Page 1 of 1

Nan Question I haven't seen yet

Posted: Fri Nov 25, 2022 8:52 pm
by MaxHunter

The bottom line of my question is, does training for shorter periods prevent NaNs?

When a NaN warning occurs it's usually after a few - several hours of training. If I were to limit training to (as an example) one hour of training 24 separate times will that avoid NaNs, with the benefit of 24 hours worth of training?

If the NaN warning is given after 6 hours of training, why doesn't it give a NaN warning immediately after re-starting? If it was detected before it should still be there soon after restarting, right? The NaN detection should already be stored in the back-ups, so in my (presumably faulty) logic this means it's tied to extreme training times, and therefore shorter training lengths mean less likelihood of NaNs.

Except for this article (which doesn't address image learning)

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7597167/

I can't find any articles that answer this question that relate to image machine learning and training time.

If training time does affect NaN appearance (as the above abstract suggests) would it behoove the program to have a training time/length batch option to address this issue? Where it's an automatic kill/restart training after a certain amount of time or iterations.

Again, forgive my ignorance if this is a obvious/stupid question. 😁


Re: Nan Question I haven't seen yet

Posted: Fri Nov 25, 2022 10:51 pm
by bryanlyon

The short answer is "probably not but maybe".

One thing that we do with our training process is we don't save out the optimizer weights. The Optimizer is a part of the model, but to save it's weights can triple the size of the model. Because of that, each time we start the model the weights are reset and the optimizer begins again.

It's possible that the optimizer being reset helps avoid NaNs for a while. Or it could actually make them happen sooner. It's hard to say exactly.

Faceswap doesn't have a time cutoff, but does have an iteration cutoff, in fact. While FaceSwap does not have a way to reset the optimizer weights on a schedule or anything like that, you could simulate the same process by setting the number of iterations low enough that the training stops after an hour or two, you could then queue up multiple runs by making a batch script with multiple copies of the generated command. This would effectively restart training every few hours and you can see if that helps.

If you find good results with that, let us know and maybe we will implement a way to reset the optimizer on a schedule.


Re: Nan Question I haven't seen yet

Posted: Sat Nov 26, 2022 2:18 am
by MaxHunter

That's a better answer than "no"! ðŸĪŠ

I've written screenplays, short stories, musicals and plays, but never a batch script. I'll figure it out! 😉

May the Nerd-Force be with me.

I'll let you know. 🙂


Re: Nan Question I haven't seen yet

Posted: Tue Nov 29, 2022 4:07 am
by MaxHunter

Just wanted to give you gents an update.

I totally screwed up writing a batch script. I'm not sure what I did but it was a disaster. 😆ðŸĪĶ. The Nerd-Force was definitely not with me.

So, I've gone back to the old fashion way of setting an alarm and restarting training manually, and capping at batches of 10,000 . So far, I've upped the learning rate, and almost two hundred thousand i.t.s in and not one NaN or NaN-warning. I'm not saying it's not coming but I'm starting to think there's something to this.

I've been keeping meticulous notes, and once it hits a million I'm calling it.


Re: Nan Question I haven't seen yet

Posted: Wed Nov 30, 2022 5:04 pm
by torzdf

Any information and hard data you can report back on this will be appreciated.


Re: Nan Question I haven't seen yet

Posted: Wed Dec 07, 2022 7:12 pm
by MaxHunter

Okay, a quick update. My short training time/small I.t. size experiment is going super well so far. Absolutely no NaNs or warnings so far, and hoping to see 900k today. Being that I couldn't get a batch script going I'm having to manually restart which is causing a delay.

Ultimately I want to see this reach 1.1 million i.t.'s which will prove a point I'll bring up at the conclusion, but I'll be happy at 1 million. If this collapses by a NaN I predict it will be between 900k-1.1 million. I have a 3090 being delivered sometime between now and Friday at which point I'll end the experiment, so I'm unsure if I'll see those numbers. I have a hypothesis, a lot of notes/hard data, and I believe this little experiment at the very least proves a need for more research, if not an actual time/i.t. cap and restart-schedule button. I'm also wondering if this is the mixed percision work around for NaNs.

(Perhaps @ianstephens @bryanlyon @torzdf or someone more computer/machine learning literate can do further experiments along the same lines.)

Anyway, a question for you two: When the model shuts down and restarts Bryan noted that the optimizer weights are dropped. Does that mean the model has to take a couple thousand i.t.'s to catch back up?


Re: Nan Question I haven't seen yet

Posted: Thu Dec 08, 2022 12:07 pm
by torzdf
MaxHunter wrote: ↑Wed Dec 07, 2022 7:12 pm

Anyway, a question for you two: When the model shuts down and restarts Bryan noted that the optimizer weights are dropped. Does that mean the model has to take a couple thousand i.t.'s to catch back up?

Pretty much that, yes. It's why you will tend to see the loss graph spike a bit when you restart training.