Page 1 of 1

Training exits with no error message - Max iterations not reached

Posted: Mon Jun 21, 2021 1:33 pm
by mgolvach

Hi,

I looked around and couldnt' find this anywhere. Probably nothing to worry about, but I was curious.

Last night (I train overnight), my training stopped for no reason I can figure out. The end of the faceswap.log looks like the end of the output of my training:

Code: Select all

06/21/2021 06:38:52 MainProcess     _training_0                    _base           _save                          INFO     [Saved models] - Average loss since last save: face_a: 0.06228, face_b: 0.07137
06/21/2021 06:49:00 MainProcess     _training_0                    _base           _save                          INFO     [Saved models] - Average loss since last save: face_a: 0.06181, face_b: 0.07024

Except at the end of my training it says:

Process exited <--- Time was 06:49:00 - on 06/21/2021 - exactly

As it normally would if I stopped it. There is no crash log.

Is this just something that happens every once in a blue moon? First time it's ever happened to me. I thought perhaps it might be because I'm training dfl-sae-df on an 8 batch (highest I can go with only GPU on my RTX-2060 6GB) or that it might be because I've just been running the program for too long. I never reached the 1000000 maximum iterations but I had trained the day before, done some manual editing of alignments, cranked out some pillow pngs to test and then went back to training so the program had been running for almost 48 hours. Was curious if maybe there was a time constraint on how long faceswap will run non-stop, or if perhaps my iteration count is getting too high and making the model too heavy (takes a while to load).

Again, no error messages and nothing in the Windows Event View logs for the time of the stop.

THanks to everyone for their great contributions to this forum. I've learnt a lot and hopefully didn't miss someone with this same situation as I searched it.

NOTE: I am at around 760K iterations, but I'm assuming that my numbers are still fairly high based on the examples I've seen here with decent results on this model around 600k on 64 and 100 batch sizes (mine's 8) and assume I have a good ways to go.

Note that I was able to just hit the train button again and training resumed. Just lost a few hours of training while I slept. No worries, just curious.

Thanks :)

, Mike


Re: Training exits with no error message - Max iterations not reached

Posted: Mon Jun 21, 2021 1:45 pm
by mgolvach

I feel silly for having written, but hopefully this will help someone out.

I'll have to reboot or just stop and start faceswap more often on my setup, I think. It ended up accumulating too much memory and not releasing it over time.

The main issue was python resource exhaustion (virtual memory used competed with the Windows 10 desktop manager process)

Thanks,

Mike

From my event viewer log:

Log Name: System
Source: Microsoft-Windows-Resource-Exhaustion-Detector
Date: 21/06/2021 06:56:27
Event ID: 2004
Task Category: Resource Exhaustion Diagnosis Events
Level: Warning
Keywords: Events related to exhaustion of system commit limit (virtual memory).
User: SYSTEM
Computer: xxxxxxx (Marked out intentionally)
Description:
Windows successfully diagnosed a low virtual memory condition. The following programs consumed the most virtual memory: dwm.exe (1460) consumed 45532336128 bytes, python.exe (4088) consumed 9649139712 bytes, and python.exe (13512) consumed 918192128 bytes.


Re: Training exits with no error message - Max iterations not reached

Posted: Sat Jun 26, 2021 9:04 am
by torzdf

Thanks for getting back with your findings.

Faceswap shouldn't use up all your RAM, so it's something I'll need to look into (at least, I know I have left models training weeks on end with no issues like this).