Training exits with no error message - Max iterations not reached

If training is failing to start, and you are not receiving an error message telling you what to do, tell us about it here


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
mgolvach
Posts: 3
Joined: Sun May 17, 2020 2:01 am
Has thanked: 1 time

Training exits with no error message - Max iterations not reached

Post by mgolvach »

Hi,

I looked around and couldnt' find this anywhere. Probably nothing to worry about, but I was curious.

Last night (I train overnight), my training stopped for no reason I can figure out. The end of the faceswap.log looks like the end of the output of my training:

Code: Select all

06/21/2021 06:38:52 MainProcess     _training_0                    _base           _save                          INFO     [Saved models] - Average loss since last save: face_a: 0.06228, face_b: 0.07137
06/21/2021 06:49:00 MainProcess     _training_0                    _base           _save                          INFO     [Saved models] - Average loss since last save: face_a: 0.06181, face_b: 0.07024

Except at the end of my training it says:

Process exited <--- Time was 06:49:00 - on 06/21/2021 - exactly

As it normally would if I stopped it. There is no crash log.

Is this just something that happens every once in a blue moon? First time it's ever happened to me. I thought perhaps it might be because I'm training dfl-sae-df on an 8 batch (highest I can go with only GPU on my RTX-2060 6GB) or that it might be because I've just been running the program for too long. I never reached the 1000000 maximum iterations but I had trained the day before, done some manual editing of alignments, cranked out some pillow pngs to test and then went back to training so the program had been running for almost 48 hours. Was curious if maybe there was a time constraint on how long faceswap will run non-stop, or if perhaps my iteration count is getting too high and making the model too heavy (takes a while to load).

Again, no error messages and nothing in the Windows Event View logs for the time of the stop.

THanks to everyone for their great contributions to this forum. I've learnt a lot and hopefully didn't miss someone with this same situation as I searched it.

NOTE: I am at around 760K iterations, but I'm assuming that my numbers are still fairly high based on the examples I've seen here with decent results on this model around 600k on 64 and 100 batch sizes (mine's 8) and assume I have a good ways to go.

Note that I was able to just hit the train button again and training resumed. Just lost a few hours of training while I slept. No worries, just curious.

Thanks :)

, Mike

Last edited by mgolvach on Tue Jun 22, 2021 1:21 pm, edited 1 time in total.
User avatar
mgolvach
Posts: 3
Joined: Sun May 17, 2020 2:01 am
Has thanked: 1 time

Re: Training exits with no error message - Max iterations not reached

Post by mgolvach »

I feel silly for having written, but hopefully this will help someone out.

I'll have to reboot or just stop and start faceswap more often on my setup, I think. It ended up accumulating too much memory and not releasing it over time.

The main issue was python resource exhaustion (virtual memory used competed with the Windows 10 desktop manager process)

Thanks,

Mike

From my event viewer log:

Log Name: System
Source: Microsoft-Windows-Resource-Exhaustion-Detector
Date: 21/06/2021 06:56:27
Event ID: 2004
Task Category: Resource Exhaustion Diagnosis Events
Level: Warning
Keywords: Events related to exhaustion of system commit limit (virtual memory).
User: SYSTEM
Computer: xxxxxxx (Marked out intentionally)
Description:
Windows successfully diagnosed a low virtual memory condition. The following programs consumed the most virtual memory: dwm.exe (1460) consumed 45532336128 bytes, python.exe (4088) consumed 9649139712 bytes, and python.exe (13512) consumed 918192128 bytes.

User avatar
torzdf
Posts: 2651
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 129 times
Been thanked: 622 times

Re: Training exits with no error message - Max iterations not reached

Post by torzdf »

Thanks for getting back with your findings.

Faceswap shouldn't use up all your RAM, so it's something I'll need to look into (at least, I know I have left models training weeks on end with no issues like this).

My word is final

Locked