Avoiding NaN Errors

MattB · Post by **MattB** » Sat Aug 20, 2022 2:11 am

Hello all, I've been toying with faceswap for a few weeks now. Many thanks to the dev team. I own a couple of SW integration shops and realize the hard work (and IQ) that goes into this sort of work. That said, I'm basically toying with different methods and the performance of each (both in speed and quality). I have four sets of source and target faces I'm using to test. The targets are mostly the same, mid-quality photos downloaded in bulk from shutterstock. The sources range from 15K low quality faces ripped from random videos to 5K high quality faces from some poor couples wedding pictures (that what you get for posting stuff on the interwebs!). I have two dedicated PC's running processes 24x7, but only check them 1-2 times daily.

Rig1 - Intel i9 (9th gen) with nVidia RTX 2070 8gb
Rig 2 - Intel (12th gen) with nVidia RTX 2070 12gb
Both have 32gb RAM and 4th gen NVMe HDD's (it's good to own IT companies!)

My problem is I always encounter NaN errors that shut down the tests before I hit 1M iterations. I try to roll back, sometimes 200K iterations, and still the NaN's show up in the same rough area, between 500-600K iterations.

That's a long input to my question which is "what are the best methods to avoid NaNs?" I've tried reducing the learning rate, batch size, etc. based on some posts I've read. But nothing seems to totally solve the issue. I'm not so much worried about processing speed, since I can leave these rigs cranking for a week if I want to. It's reliability, so I can determine quality for different models/applications, etc.

Any advice is appreciated. I mean other than "find a better use of my time."

FYI, since I've never had a batch complete 1M iterations I'm splitting one rig working on Rig one running Dfaker models and Rig two running Villain (because of GPU constraints).

Cheers!

Post by **torzdf** » Sat Aug 20, 2022 9:59 am

NaNs are the bane of all data scientists lives. There is no real magic bullet. By far and away the easiest way to avoid NaNs is not to use Mixed Precision. I know that this is not always feasible, due to the huge VRAM savings that mixed precision introduces. However a cursory Google search for "mixed precision NaN" will bring you back a plethora of results across multiple ML platforms.

My general rule of thumb is roll back about 50k, and lower learning rate. If it doesn't fix the issue, rollback again and reduce more. Not ideal.
Enabling "AutoClip" may also help, but jury is still out on that.

I'm trying to investigate alternative VRAM saving techniques which, unfortunately, will not have the speed benefits of mixed precision.

Here are some other posts which cover this topic.
viewtopic.php?p=7361
viewtopic.php?p=7185

MattB · Post by **MattB** » Sat Aug 20, 2022 7:43 pm

I'm not sure how much the graphs really tell me. I tend to just look at the previews. But if I change the learning rate after a flameout (NaN) the graph indicates a huge increase in loss rate. Does this indicate a problem, or should I just ignore it?

After several iterations of failure I think I've settled into settings on each computer (Rig1 = dfaker, Rig2 = Villain) that are stable. Now to wait a day or two for dFaker and put an entry in my will for my grandkids to check the Villain progress when it finally finishes in 2096.

Thanks again for the advice. I dropped $100 into the donation bin so you folks can keep up the good work!

_mb

Post by **torzdf** » Sat Aug 20, 2022 9:53 pm

MattB wrote: ↑Sat Aug 20, 2022 7:43 pm
I'm not sure how much the graphs really tell me. I tend to just look at the previews. But if I change the learning rate after a flameout (NaN) the graph indicates a huge increase in loss rate. Does this indicate a problem, or should I just ignore it?

I think you mean loss values, but yeah. Ignore it. It's because we don't save the optimizer state (I did add an option to save optimizer state recently, but had to disable it due to a bug in tensorflow). Saving the optimizer state means that restarting training will continue exactly from where it left off. However, it over trebles the size of the model save file.

Not saving it (as we do) means that it takes a few 100/1000 iterations to catch up again.

MattB wrote: ↑Sat Aug 20, 2022 7:43 pm
Thanks again for the advice. I dropped $100 into the donation bin so you folks can keep up the good work!

This is hugely appreciated. Developing and supporting this software takes a lot of my time, so anything and everything is a big help!

MattB · Post by **MattB** » Fri Oct 14, 2022 5:06 pm

More questions on NaN avoidance... I have a training set that I've rolled back and the NaN repeated. I'm used to it now, but want to better nuance how I recover. I can literally use the preview tool and see where the NaN begins to occur, evidencing pixilation around the eyes which I'm assuming is an exploding gradient (he said with 90% ignorance).

First question: I read somewhere that increasing batch size lowers learning rate (which seems counter intuitive). But I've also read that lowering batch size can avoid NaNs, presumably by lowering the learning rate. Which is correct?

Second question: Is it better to just directly reduce learning rate, change batch size, or both? And if I lower learning rate should I leave it lowered or run the model for 250K then increase it again.

Third question: This model, which I've run to 2M iterations several times, only gets a NaN when I increase eye multiplier to 2. Given the note above about the gradient beginning with the eyes, is the multiplier related or my imagination?

I wish I better understood the full math in all of this and how different variables effect the overall process. Then I wouldn't have to ask so many ignorant questions. Then again, I wish I was taller too.

Any advice or input is greatly appreciated.

Post by **bryanlyon** » Fri Oct 14, 2022 5:17 pm

Nobody has all the answers regarding NaNs. If you're running into them a lot with a given dataset, then you may want to try lowering the LR, reducing the Batch size, making sure not to use any multipliers, turning off mixed precision, and maybe even changing the loss function. Any or all of these can help, but they're just attempts to balance on a knife's edge.

The basic theory is that a gradient can grow out of control if it hits a resonance in the model. There is no way to outright eliminate these as it's a fundamental issue with matrix math. The best way to deal with it is to change the inputs (and all those methods do exactly that).

An example of resonance in a bridge:

Essentially, if the frequencies match just right, you get a gradient that swings back and forth, getting more extreme with each cycle. With an AI model you're dealing with thousands or even hundreds of thousands of dimensions so it's relatively easy for a a resonance to develop.

The advice given here on changing your input by lowering the LR or adding new data or even just rolling back to an older save (which gets rid of the optimizers) may help, but it's never guarantied that you've eliminated the resonance. Doing more changes does increase the likelihood that you've done enough. to avoid it so maybe my advice would be to try changing as much as you're comfortable with to see if that helps. But there's never a guarantee that you wont just create another resonance somewhere else.

MaxHunter · Post by **MaxHunter** » Sat Oct 15, 2022 1:26 am

Thanks Bryan for the video and explanation. (The bridge resonance example is a great illustration.)

I've been struggling with understanding how to set learning rates and epsilons as well. In addition, yesterday I encountered a Nan, and trying to fix it I adjusted the learning rates and this morning I woke up to apparently the same NaN. Moments before reading this I was reading this link:

https://www.baeldung.com/cs/ml-learning-rate

And I'm still trying to take it all in as it's been years since I've taken a proper math class.

My take-away from the paper is to continually lower the learning rates as learning progresses (and can prevent Nans.) But then again, isn't that what the optimizers do? Aren't they automatically lowering the learning rates as the model progresses? And if they are, then we should only touch the learning rates when we have a problem (like a Nan,) correct?

MattB · Post by **MattB** » Sat Oct 15, 2022 10:29 am

Thanks for the feedback gents. The bridge example is excellent. Well played sir.

I guess my hopes that one of you can just hand-edit my model to remove flaws isn't going to happen?

Truth is if I leave settings at their default, in most cases I can run to >2M iterations with one or less NaN's. Maybe I should just accept that poking at complex things cannot end well. Though I feel like I turned my back on mathematics (my ancient foe) and it somehow grew more complicated while I wasn't looking.

Post by **torzdf** » Mon Oct 17, 2022 9:34 am

MattB wrote: ↑Sat Oct 15, 2022 10:29 am
I guess my hopes that one of you can just hand-edit my model to remove flaws isn't going to happen?

I actually added the "NaN-Scan" option to the model tool to try to cover this exact scenario (find NaNs in a model's weights, reset those weights and continue training). The sad discovery I made though was that there were no NaNs in the model, as the NaN comes from the loss function, which would then propagate back into the model if the model hadn't aborted, so there is nothing to fix.

As Bryan says, at the point that the NaN is hit, the model is well on its way to inevitably hitting a NaN, so a rollback and change of settings is the only mechanism for hopefully avoiding a future NaN. My experience tells me that this does not necessarily fix the issue though, it just pushes the inevitability down the road a bit.

NaNs are the bane of my life :/

MaxHunter · Post by **MaxHunter** » Mon Oct 17, 2022 4:48 pm

torzdf wrote: ↑Mon Oct 17, 2022 9:34 am
as the NaN comes from the loss function, which would then propagate back into the model if the model hadn't aborted, so there is nothing to fix.

Whoa whoa whoa.

NaNs come from the loss functions?!!

If this is true, this nugget of information changes everything for me.

Post by **torzdf** » Mon Oct 17, 2022 5:18 pm

Sorry, it can come from the loss function. AutoClip should prevent this (I may add clip-norm option too in case AutoClip is not working as expected). It can also come from the weights too.

In the tests I was running with NaN scans, there were no NaNs in the weights (which I saved prior to aborting during testing) which means, in those instances, they were coming from the loss function. However a weight update from a valid loss calculation can cause the weights to NaN too...

As with all things ML, I'm sure this is as clear as mud.

MaxHunter · Post by **MaxHunter** » Mon Oct 17, 2022 11:20 pm

torzdf wrote: ↑Mon Oct 17, 2022 5:18 pm
As with all things ML, I'm sure this is as clear as mud.

Right?!!

So....

What you're saying is, if I were to change things in the Phase A settings (say around, 300k its) because I thought the model was stable enough. This could actually hurt the model and cause NaNs?

I'm asking because I changed the DNY512, and upped depth, dimensions and DC min filters to 128, thinking I had enough VRAM headroom. Could this have caused the NaN late in the "game" (at around 900k it's)? Because even when I went as far back as 700k I was getting NaNs.

I think we should have a NaN/OOM dedicated discussion thread.

MaxHunter · Post by **MaxHunter** » Mon Oct 17, 2022 11:27 pm

Also...

Would it behoove me to do a NaN scan anytime I change a setting? Or, would it probably not matter because NaN-scan isn't working as intended?

Post by **torzdf** » Tue Oct 18, 2022 9:26 am

MaxHunter wrote: ↑Mon Oct 17, 2022 11:20 pm
I'm asking because I changed the DNY512, and upped depth, dimensions and DC min filters to 128, thinking I had enough VRAM headroom. Could this have caused the NaN late in the "game" (at around 900k it's)? Because even when I went as far back as 700k I was getting NaNs.

Changing a model's settings for an already started model will have no effect Model structure is locked when you start training. It is impossible to change filters/depth etc for a model you have started, so anything you are seeing is placebo. The model structure is still the same model that you started with.

MaxHunter wrote: ↑Mon Oct 17, 2022 11:27 pm
Would it behoove me to do a NaN scan anytime I change a setting? Or, would it probably not matter because NaN-scan isn't working as intended?

No NaN-Scan is working as intended, I just have never had a model where I have found NaNs within the weights. This does make sense (with mixed precision). Mixed precision means that weights are stored in floating point 32 but calculations are done in floating point 16, so whilst a calculation may cause a NaN, the underlying weights that caused it may not be NaN'd

MaxHunter · Post by **MaxHunter** » Tue Oct 18, 2022 10:09 am

Oh, man!!! On one hand this is fantastic news, on another...

You have no idea how much time I have wasted, staring at my screen for hours trying to find tune my model .

So, all of those oom's I received thinking I was pushing it to the limits were wasted? Welp, better to know now than two years down the road. .

Thank you for taking the time to explain both points. I obviously had no idea.

Faceswap Forum

Avoiding NaN Errors

Avoiding NaN Errors

Re: Avoiding NaN Errors

Re: Avoiding NaN Errors

Re: Avoiding NaN Errors

Re: Avoiding NaN Errors

Re: Avoiding NaN Errors

Re: Avoiding NaN Errors

Re: Avoiding NaN Errors

Re: Avoiding NaN Errors

Re: Avoiding NaN Errors

Re: Avoiding NaN Errors

Re: Avoiding NaN Errors

Re: Avoiding NaN Errors

Re: Avoiding NaN Errors

Re: Avoiding NaN Errors