Training exits with no error message - Max iterations not reached
Hi,
I looked around and couldnt' find this anywhere. Probably nothing to worry about, but I was curious.
Last night (I train overnight), my training stopped for no reason I can figure out. The end of the faceswap.log looks like the end of the output of my training:
Code: Select all
06/21/2021 06:38:52 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.06228, face_b: 0.07137
06/21/2021 06:49:00 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.06181, face_b: 0.07024
Except at the end of my training it says:
Process exited <--- Time was 06:49:00 - on 06/21/2021 - exactly
As it normally would if I stopped it. There is no crash log.
Is this just something that happens every once in a blue moon? First time it's ever happened to me. I thought perhaps it might be because I'm training dfl-sae-df on an 8 batch (highest I can go with only GPU on my RTX-2060 6GB) or that it might be because I've just been running the program for too long. I never reached the 1000000 maximum iterations but I had trained the day before, done some manual editing of alignments, cranked out some pillow pngs to test and then went back to training so the program had been running for almost 48 hours. Was curious if maybe there was a time constraint on how long faceswap will run non-stop, or if perhaps my iteration count is getting too high and making the model too heavy (takes a while to load).
Again, no error messages and nothing in the Windows Event View logs for the time of the stop.
THanks to everyone for their great contributions to this forum. I've learnt a lot and hopefully didn't miss someone with this same situation as I searched it.
NOTE: I am at around 760K iterations, but I'm assuming that my numbers are still fairly high based on the examples I've seen here with decent results on this model around 600k on 64 and 100 batch sizes (mine's 8) and assume I have a good ways to go.
Note that I was able to just hit the train button again and training resumed. Just lost a few hours of training while I slept. No worries, just curious.
Thanks
, Mike