Page 1 of 1

[Discussion] How to fix Mixed Precision causing NaNs

Posted: Thu Aug 18, 2022 6:36 pm
by Icarus

Mixed Precission: :
Last but not least, Mixed Precision. You love it and you hate it. It does make a huge difference in training speed and VRAM but is the frequent culprit of NaNs. I did some research on Nvidia's website regarding this and I found the holy grail of hidden information that has cured me of the downside to using it. It all comes down to the Epsilon. Nvidia recommends increasing your epsilon by 1e3 when training with Mixed Precision. So instead of the default 1e-07, I use 1e-04 and this has made the world of difference with 0 downside in terms of the models ability to learn and most importantly no more NaNs.

Copied from my larger Phaze A notes post.


Re: How to fix Mixed Precision causing NaNs

Posted: Thu Aug 18, 2022 11:03 pm
by torzdf

I have to admit, I have fallen badly out of love with Mixed Precision. Lowering the epsilon exponent certainly does help, and it's good to know it can be taken fairly high with no real detrimental effect.

I basically added the ability to switch Mixed Precision on and off (for the same model) to try to work around the NaN issue a bit, but any guidance on how to reduce likelihood of NaNs is appreciated.

I am currently looking at other potential VRAM saving tricks (although sadly with associated slow down, rather than the speed up that mixed precision gets us).


Re: How to fix Mixed Precision causing NaNs

Posted: Sat Aug 20, 2022 10:35 pm
by Icarus
torzdf wrote: ↑Thu Aug 18, 2022 11:03 pm

I have to admit, I have fallen badly out of love with Mixed Precision. Lowering the epsilon exponent certainly does help, and it's good to know it can be taken fairly high with no real detrimental effect.

I think it has something to do with FP16's representable range. This is what Nvidia has to say:

..."shifting by three exponent values (multiply by 8) was sufficient to match the accuracy achieved with FP32 training by recovering the relevant values lost to 0. Shifting by 15 exponent values (multiplying by 32K) would recover all but 0.1% of values lost to 0 when converting to FP16 and still avoid overflow. In other words, FP16 dynamic range is sufficient for training, but gradients may have to be scaled to move them into the range to keep them from becoming zeros in FP16."
https://docs.nvidia.com/deeplearning/pe ... index.html

Then there was this post about Adam, Mixed precision and NaN's, that basically says how changing their epsilon from 7e-05 to 3e-05 worked for them:
https://discuss.pytorch.org/t/adam-half ... ans/1765/4

Also thank you for making it possible to turn M.P. on/off. This has been a tremendous help!


Re: [Discussion] How to fix Mixed Precision causing NaNs

Posted: Thu Oct 20, 2022 6:46 pm
by MaxHunter

Please forgive me, if this is a dumb question, my brain hasn't done this type of math in 25 years.

I know Torz just said "lowering can help.". But just to clarify I'm understanding correctly...

If shifting 3 is a solution (for mixed precision with adam,) then can lowering the epsilon from -7 to -10 be a good thing? (Or, maybe even -16 as is the case for adabelief.) Instead of raising -7 to -4? I'm having major NaN problems and haven't found a solution (other than just turning off NaN protection.). However, I've found lowering the epsilon helps. Is this just a placebo, or is it actually helping? The logic in my brain says that by adding more zeros at the end of the loss number is giving the number more room to breath and work its self out of the NaN. The downside is, it's just going to be slower?

Question #2:

It says in the Nvidia paper to shift three (multiply by 8.). Doesn't this mean epsilon at -10 with a learning rate of 8e-5? (8 being the multiplier)

Again I know this sounds obvious to you but my math/algebra/statistics is extremely rusty. 😉


Re: [Discussion] How to fix Mixed Precision causing NaNs

Posted: Fri Oct 21, 2022 9:19 am
by torzdf
MaxHunter wrote: ↑Thu Oct 20, 2022 6:46 pm

If shifting 3 is a solution (for mixed precision with adam,) then can lowering the epsilon from -7 to -10 be a good thing? (Or, maybe even -16 as is the case for adabelief.) Instead of raising -7 to -4?

The epsilon is a small number that is added to the optimizer calculation to prevent it from reducing to zero. We control the 'exponent' part of the epsilon so -3 would be 0.001 -5 would be 0.00001 and so on and so forth. An epsilon closer to zero (so -3 in the given example) should be less likely to NaN than one further from zero (as the number being added to the optimizer is larger), however, the larger the epsilon, the lower the accuracy of the model, so (as with all things ML) its a balancing act.

One of the things you need to bear in mind is numeric range though of the floating point being used. An epsilon exponent below -7 will fall outside of the numeric of fp16 (i.e. when using mixed precision). and so will have no effect (it will round to 0) Example:

Code: Select all

>>> np.float16(1e-7)
1e-07
>>> np.float16(1e-8)
0.0

fp32 has a much larger numeric range, so the exponent could be reduced more.

I'm not sure I understand your second question. A link to the reference would be useful.


Re: [Discussion] How to fix Mixed Precision causing NaNs

Posted: Sat Oct 22, 2022 3:52 pm
by MaxHunter

Thanks for the indepth explaination. I know I'm a pain, but I honestly do research the questions beforehand. I didn't know about the effect going beyond a -7 would have on the performance. Makes sense and answers why my performance slowed to a crawl when I updated it to -10.

As for the second question, and the reference to the Nvidia paper (in which I fully read and only partially understood 😆,) in the link Icarus posted:

"... by three exponent values (multiply by 8) was sufficient to match the accuracy achieved with FP32 training by recovering the relevant values lost to 0. Shifting by 15 exponent values (multiplying by 32K) would recover all but 0.1% of values lost to 0 when converting to FP16 and still avoid overflow. In other words, FP16 dynamic range is sufficient for training, but gradients may have to be scaled to move them into the range to keep them from becoming zeros in FP16."
https://docs.nvidia.com/deeplearning/pe ... index.html

My confusion is in the "multiply by 8". I thought it was referring to the learning rate formula.

I think the answer to my confusion is wrapped up in what the learning rate represents (I know it represents how fast the machine learns, but what do the numbers represent,) how it's calculated, and what the "e" in the learning rate formula represents. I thought the "e" represented the "epsilon". So, I'm interpreting the Nvidia paper to read "8(-7)-5" where as 8 is the multiplier referenced in the paper, -7 is the epsilon, then subtract 5.

I also thought the "e" could represent the epoch. So a learning rate of 8e-5, is interpreted as, for every 8 epochs, subtract 5.

Or, does "e" represent the exponent? 8 to the -7 power, subtract 5.

I have looked and researched hours of my day, and every paper I've come across of course assumes I'm a data scientist (not a middle aged failed writer 😆) and therefore assumes I already know what the learning rate formula represents and how it's calculated.

This is the danger when idiots like me read scientific papers well above their head. 😆

If you can answer this question, I promise not to ask another question for one weeks time. ðŸĪŠðŸ˜†


Re: [Discussion] How to fix Mixed Precision causing NaNs

Posted: Sun Oct 23, 2022 5:48 pm
by torzdf
MaxHunter wrote: ↑Sat Oct 22, 2022 3:52 pm

"... by three exponent values (multiply by 8) was sufficient to match the accuracy achieved with FP32 training by recovering the relevant values lost to 0. Shifting by 15 exponent values (multiplying by 32K) would recover all but 0.1% of values lost to 0 when converting to FP16 and still avoid overflow. In other words, FP16 dynamic range is sufficient for training, but gradients may have to be scaled to move them into the range to keep them from becoming zeros in FP16."
https://docs.nvidia.com/deeplearning/pe ... index.html

That's loss scaling. Faceswap handles that automatically for you.

I thought the "e" represented the "epsilon". So, I'm interpreting the Nvidia paper to read "8(-7)-5" where as 8 is the multiplier referenced in the paper, -7 is the epsilon, then subtract 5.

I also thought the "e" could represent the epoch. So a learning rate of 8e-5, is interpreted as, for every 8 epochs, subtract 5.

e is common mathematics notation to indicate "multiply by 10". You will see it commonly in excel, in calculators etc. etc. It is generally used to represent very big/very small numbers that would not normally fit a display. so:

Code: Select all

>>> 1e1   # 1 * 10
10.0
>>> 1e2  # 1 * 100
100.0
>>> 1e3  # 1 * 1000
1000.0

>>> 1e-1  # 1 * 0.1
0.1
>>> 1e-2  # 1 * 0.01
0.01
>>> 1e-3  # 1 * 0.001
0.001

In layman's terms it basically means "move the decimal place". The Epsilon Exponent in Faceswap is the number that appears after the e in the above formulas

Better explained here:
https://chortle.ccsu.edu/java5/notes/chap11/ch11_5.html


Re: [Discussion] How to fix Mixed Precision causing NaNs

Posted: Sun Oct 23, 2022 8:53 pm
by MaxHunter

Ohhhhhhhhhhhhhhhhhhhhhhhhhh.
ðŸĪ”😎😂. Believe it or not you answered two questions that I've been obsessing over since before the original post, and have therefore also answered a multitude of other related questions.

As promised. No more questions.

...for a week. 😁.

Thank you, my brain got a little less foggier.


Re: [Discussion] How to fix Mixed Precision causing NaNs

Posted: Tue Oct 25, 2022 11:03 pm
by martinf

I just had a NAN with NAN protection on and the model was still toast for some reason. I turned mixed precision off, adjusted the epsilon and tried to resume. The next preview update was garbage...

Is NAN protection supposed to save the model or just save electricity by shutting down?


Re: [Discussion] How to fix Mixed Precision causing NaNs

Posted: Thu Oct 27, 2022 10:24 am
by torzdf

It just instantly stops training if a NaN is seen without saving the model.

You should always roll back (I normally go back about 50k-100k iterations) and lower learning rate when resuming as whilst the model has been stopped before a NaN gets into the weights, the model is already well on the way to a NaN when the NaN appears in the loss output (see Bryan's earlier oscillating bridge example)