CRITICAL NaN Detected and cannot train

If training is failing to start, and you are not receiving an error message telling you what to do, tell us about it here


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
wongca
Posts: 10
Joined: Thu Aug 20, 2020 9:52 pm
Has thanked: 1 time

CRITICAL NaN Detected and cannot train

Post by wongca »

CPU: AMD Ryzen 5 3600
GPU: Nvdia RTX 2060
training type: Dlight 256

[00:10:31] [#131623] Loss A: 0.02657, Loss B: 0.0204512/10/2021 00:10:34 CRITICAL NaN Detected. Loss: [0.022320855408906937, nan]
12/10/2021 00:10:34 CRITICAL Error caught! Exiting...
12/10/2021 00:10:34 ERROR Caught exception in thread: '_training_0'
12/10/2021 00:10:34 ERROR A NaN was detected and you have NaN protection enabled. Training has been terminated.
Process exited.

Hi all, I got this error message when i train more than 125000 lterations for every videos. if i retrain again and it will stop for around 2-3 hrs .

Thanks

User avatar
bryanlyon
Site Admin
Posts: 793
Joined: Fri Jul 12, 2019 12:49 am
Answers: 44
Location: San Francisco
Has thanked: 4 times
Been thanked: 218 times
Contact:

Re: CRITICAL NaN Detected and cannot train

Post by bryanlyon »

Dlight is prone to NAN errors. It's just a part of it's design that it's more likely to hit a NAN. You can try reducing the Learning Rate to reduce the chance of NANs. There are also other things you might try to reduce them, but you wont completely eliminate them. When you hit a NAN the recommended step is to roll back to a previous snapshot or backup and continue from there.

User avatar
torzdf
Posts: 2649
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 128 times
Been thanked: 623 times

Re: CRITICAL NaN Detected and cannot train

Post by torzdf »

bryanlyon wrote: Thu Dec 09, 2021 3:56 pm

When you hit a NAN the recommended step is to roll back to a previous snapshot or backup and continue from there.

I would add to this... rollback significantly (i.e.. at least 50k). NaNs will exist in the model for quite an amount of time prior to appearing in Loss.

My word is final

Locked