This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.
Please mark any answers that fixed your problems so others can find the solutions.
[00:10:31] [#131623] Loss A: 0.02657, Loss B: 0.0204512/10/2021 00:10:34 CRITICAL NaN Detected. Loss: [0.022320855408906937, nan]
12/10/2021 00:10:34 CRITICAL Error caught! Exiting...
12/10/2021 00:10:34 ERROR Caught exception in thread: '_training_0'
12/10/2021 00:10:34 ERROR A NaN was detected and you have NaN protection enabled. Training has been terminated.
Process exited.
Hi all, I got this error message when i train more than 125000 lterations for every videos. if i retrain again and it will stop for around 2-3 hrs .
Dlight is prone to NAN errors. It's just a part of it's design that it's more likely to hit a NAN. You can try reducing the Learning Rate to reduce the chance of NANs. There are also other things you might try to reduce them, but you wont completely eliminate them. When you hit a NAN the recommended step is to roll back to a previous snapshot or backup and continue from there.
When you hit a NAN the recommended step is to roll back to a previous snapshot or backup and continue from there.
I would add to this... rollback significantly (i.e.. at least 50k). NaNs will exist in the model for quite an amount of time prior to appearing in Loss.