Page 1 of 1

CRITICAL NaN Detected and cannot train

Posted: Thu Dec 09, 2021 2:32 pm
by wongca

CPU: AMD Ryzen 5 3600
GPU: Nvdia RTX 2060
training type: Dlight 256

[00:10:31] [#131623] Loss A: 0.02657, Loss B: 0.0204512/10/2021 00:10:34 CRITICAL NaN Detected. Loss: [0.022320855408906937, nan]
12/10/2021 00:10:34 CRITICAL Error caught! Exiting...
12/10/2021 00:10:34 ERROR Caught exception in thread: '_training_0'
12/10/2021 00:10:34 ERROR A NaN was detected and you have NaN protection enabled. Training has been terminated.
Process exited.

Hi all, I got this error message when i train more than 125000 lterations for every videos. if i retrain again and it will stop for around 2-3 hrs .

Thanks


Re: CRITICAL NaN Detected and cannot train

Posted: Thu Dec 09, 2021 3:56 pm
by bryanlyon

Dlight is prone to NAN errors. It's just a part of it's design that it's more likely to hit a NAN. You can try reducing the Learning Rate to reduce the chance of NANs. There are also other things you might try to reduce them, but you wont completely eliminate them. When you hit a NAN the recommended step is to roll back to a previous snapshot or backup and continue from there.


Re: CRITICAL NaN Detected and cannot train

Posted: Fri Dec 10, 2021 2:25 pm
by torzdf
bryanlyon wrote: Thu Dec 09, 2021 3:56 pm

When you hit a NAN the recommended step is to roll back to a previous snapshot or backup and continue from there.

I would add to this... rollback significantly (i.e.. at least 50k). NaNs will exist in the model for quite an amount of time prior to appearing in Loss.