Nan Question I haven't seen yet

Want to understand the training process better? Got tips for which model to use and when? This is the place for you


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for discussing tips and understanding the process involved with Training a Faceswap model.

If you have found a bug are having issues with the Training process not working, then you should post in the Training Support forum.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
MaxHunter
Posts: 193
Joined: Thu May 26, 2022 6:02 am
Has thanked: 176 times
Been thanked: 13 times

Nan Question I haven't seen yet

Post by MaxHunter »

The bottom line of my question is, does training for shorter periods prevent NaNs?

When a NaN warning occurs it's usually after a few - several hours of training. If I were to limit training to (as an example) one hour of training 24 separate times will that avoid NaNs, with the benefit of 24 hours worth of training?

If the NaN warning is given after 6 hours of training, why doesn't it give a NaN warning immediately after re-starting? If it was detected before it should still be there soon after restarting, right? The NaN detection should already be stored in the back-ups, so in my (presumably faulty) logic this means it's tied to extreme training times, and therefore shorter training lengths mean less likelihood of NaNs.

Except for this article (which doesn't address image learning)

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7597167/

I can't find any articles that answer this question that relate to image machine learning and training time.

If training time does affect NaN appearance (as the above abstract suggests) would it behoove the program to have a training time/length batch option to address this issue? Where it's an automatic kill/restart training after a certain amount of time or iterations.

Again, forgive my ignorance if this is a obvious/stupid question. 😁

Last edited by MaxHunter on Sat Nov 26, 2022 1:13 am, edited 2 times in total.
User avatar
bryanlyon
Site Admin
Posts: 793
Joined: Fri Jul 12, 2019 12:49 am
Answers: 44
Location: San Francisco
Has thanked: 4 times
Been thanked: 218 times
Contact:

Re: Nan Question I haven't seen yet

Post by bryanlyon »

The short answer is "probably not but maybe".

One thing that we do with our training process is we don't save out the optimizer weights. The Optimizer is a part of the model, but to save it's weights can triple the size of the model. Because of that, each time we start the model the weights are reset and the optimizer begins again.

It's possible that the optimizer being reset helps avoid NaNs for a while. Or it could actually make them happen sooner. It's hard to say exactly.

Faceswap doesn't have a time cutoff, but does have an iteration cutoff, in fact. While FaceSwap does not have a way to reset the optimizer weights on a schedule or anything like that, you could simulate the same process by setting the number of iterations low enough that the training stops after an hour or two, you could then queue up multiple runs by making a batch script with multiple copies of the generated command. This would effectively restart training every few hours and you can see if that helps.

If you find good results with that, let us know and maybe we will implement a way to reset the optimizer on a schedule.

User avatar
MaxHunter
Posts: 193
Joined: Thu May 26, 2022 6:02 am
Has thanked: 176 times
Been thanked: 13 times

Re: Nan Question I haven't seen yet

Post by MaxHunter »

That's a better answer than "no"! 🤪

I've written screenplays, short stories, musicals and plays, but never a batch script. I'll figure it out! 😉

May the Nerd-Force be with me.

I'll let you know. 🙂

User avatar
MaxHunter
Posts: 193
Joined: Thu May 26, 2022 6:02 am
Has thanked: 176 times
Been thanked: 13 times

Re: Nan Question I haven't seen yet

Post by MaxHunter »

Just wanted to give you gents an update.

I totally screwed up writing a batch script. I'm not sure what I did but it was a disaster. 😆🤦. The Nerd-Force was definitely not with me.

So, I've gone back to the old fashion way of setting an alarm and restarting training manually, and capping at batches of 10,000 . So far, I've upped the learning rate, and almost two hundred thousand i.t.s in and not one NaN or NaN-warning. I'm not saying it's not coming but I'm starting to think there's something to this.

I've been keeping meticulous notes, and once it hits a million I'm calling it.

User avatar
torzdf
Posts: 2651
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 129 times
Been thanked: 622 times

Re: Nan Question I haven't seen yet

Post by torzdf »

Any information and hard data you can report back on this will be appreciated.

My word is final

User avatar
MaxHunter
Posts: 193
Joined: Thu May 26, 2022 6:02 am
Has thanked: 176 times
Been thanked: 13 times

Re: Nan Question I haven't seen yet

Post by MaxHunter »

Okay, a quick update. My short training time/small I.t. size experiment is going super well so far. Absolutely no NaNs or warnings so far, and hoping to see 900k today. Being that I couldn't get a batch script going I'm having to manually restart which is causing a delay.

Ultimately I want to see this reach 1.1 million i.t.'s which will prove a point I'll bring up at the conclusion, but I'll be happy at 1 million. If this collapses by a NaN I predict it will be between 900k-1.1 million. I have a 3090 being delivered sometime between now and Friday at which point I'll end the experiment, so I'm unsure if I'll see those numbers. I have a hypothesis, a lot of notes/hard data, and I believe this little experiment at the very least proves a need for more research, if not an actual time/i.t. cap and restart-schedule button. I'm also wondering if this is the mixed percision work around for NaNs.

(Perhaps @ianstephens @bryanlyon @torzdf or someone more computer/machine learning literate can do further experiments along the same lines.)

Anyway, a question for you two: When the model shuts down and restarts Bryan noted that the optimizer weights are dropped. Does that mean the model has to take a couple thousand i.t.'s to catch back up?

User avatar
torzdf
Posts: 2651
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 129 times
Been thanked: 622 times

Re: Nan Question I haven't seen yet

Post by torzdf »

MaxHunter wrote: Wed Dec 07, 2022 7:12 pm

Anyway, a question for you two: When the model shuts down and restarts Bryan noted that the optimizer weights are dropped. Does that mean the model has to take a couple thousand i.t.'s to catch back up?

Pretty much that, yes. It's why you will tend to see the loss graph spike a bit when you restart training.

My word is final

Locked