Shorter Training=Smaller Chances of NaNs; A hypothesis by an Uneducated Idiot

MaxHunter · Post by **MaxHunter** » Sat Dec 10, 2022 12:47 am

This is a continuation of, viewtopic.php?t=2380

In the original post I started an experiment doing shorter training cycles because I found by doing long training cycles over the course of hours was resulting in NaN warnings and eventually a full collapse by a NaN.

With this experiment I used a model, we'll call Angie 2, that was failing continually. Angie 2, had every opportunity to fail, and I believe, would have failed had it not been for the shorter training cycles. To begin with, Angie 2 was pre-weighted with the weights of a model (we'll call, Angie 1) that completely collapsed after a little over 1 million iterations. Angie 1 was never rolled back when a warning appeared, instead I would just plow ahead wondering how far I could take it before a complete collapse. I was continually messing with it (including the architecture which I found out later can't be changed.)

An excerpt of Angie 1's (The Pre-Weight Model) Notes:

10-12-22

Trained: Batch 3; 9hrs/89K/.09+
Trained: Batch 3; 3hrs/28.5+K/904K+/.09

10-22-22

Because of NaNs this model hasn't been very stable and training it's have been inconsistent. I plan to take this to 1M its and then start a new one.

Currently we are at 978K/A: .0816 B: .0744

10-23-22

TRained - NaN'd OUT Start new model. No saving this one.

1.025M/A: .0874; B: .0795

As you can see it collapsed around 1 million. I used the the 900K snapshot to pre-load the weights into a new model (Angie 2) knowing that doing so could lead to another NaN infection.

Angie 2 wasn't just weighted down with bad data from a collapsed model but I further complicated the architecture and Loss Weights to where I couldn't get more than a batch size of 1 (compared to Angie 1's BS3,) even in central storage. Both were based on DNY512, but Angie 2 had it modified.

Angie 2 Model:

[model.phaze_a]
output_size: 512
shared_fc: none
enable_gblock: True
split_fc: True
split_gblock: False
split_decoders: True
enc_architecture: fs_original
enc_scaling: 50
enc_load_weights: False
bottleneck_type: dense
bottleneck_norm: none
bottleneck_size: 512
bottleneck_in_encoder: True
fc_depth: 1
fc_min_filters: 512
fc_max_filters: 512
fc_dimensions: 1
fc_filter_slope: 0.0
fc_dropout: 0.0
fc_upsampler: upsample2d
fc_upsamples: 2
fc_upsample_filters: 128
fc_gblock_depth: 3
fc_gblock_min_nodes: 512
fc_gblock_max_nodes: 512
fc_gblock_filter_slope: -0.05
fc_gblock_dropout: 0.0
dec_upscale_method: upscale_dny
dec_upscales_in_fc: 2
dec_norm: none
dec_min_filters: 16
dec_max_filters: 512
dec_slope_mode: cap_max
dec_filter_slope: 0.5
dec_res_blocks: 0
dec_output_kernel: 1
dec_gaussian: False
dec_skip_last_residual: False
freeze_layers: encoder
load_layers: encoder
fs_original_depth: 8
fs_original_min_filters: 16
fs_original_max_filters: 512
fs_original_use_alt: True
mobilenet_width: 1.0
mobilenet_depth: 1
mobilenet_dropout: 0.001
mobilenet_minimalistic: False

--------- train.ini ---------

[global]
centering: face
coverage: 87.5
icnr_init: False
conv_aware_init: True
optimizer: adam
learning_rate: 1e-05
epsilon_exponent: -7
autoclip: True
reflect_padding: False
allow_growth: False
mixed_precision: True
nan_protection: True
convert_batchsize: 9

[global.loss]
loss_function: ms_ssim
loss_function_2: logcosh
loss_weight_2: 50
loss_function_3: lpips_vgg16
loss_weight_3: 35
loss_function_4: ffl
loss_weight_4: 100
mask_loss_function: mse
eye_multiplier: 2
mouth_multiplier: 4
penalized_mask_loss: True
mask_type: bisenet-fp_face
mask_blur_kernel: 3
mask_threshold: 4
learn_mask: False

This model was doomed to fail. Bad pre-weights, large complex model, batch size of 1, etc. In fact it did fail:

Trained: Freeze Weights (old Angela NaN'd/900K snapshot as a "booster"; 2e-5/-7; 6+hrs/57K/57K (Failed NaN); face_a: 0.14257, face_b: 0.09948

Trained: 1e-5/-7; 4.5+hrs/38.5+K/96+K; face_a: 0.11570, face_b: 0.08441 (Failed Nan'd)

11.21.22

Trained: 8e-6/-7; 4+hrs/35+K; face_a: 0.11147, face_b: 0.07751 (Failed NaN)

Trained: 7e-6/-7; 2+hrs/17.5K/181K; face_a: 0.10135, face_b: 0.07386

11.22.22

Trained: 7e-6/-7; 4hrs/34K/239K ; face_a: 0.09953, face_b: 0.07264

11.23.22

Trained: 7e-6/-7; 3hrs/25+K/264+K; .096 (Failed NaN)

Trained: 6e-6/-7 (If this fails roll back to 100K and start at 6e-6. This is the last stable number for Sara Model)
3+hrs/26K/266K ; A: .095+

Aligned and remasked: 0-550
Faces regenerated

11.24-22

Rolled back to 100K
Trained: 6e-6/-7; Failed Nan @130K. Rolled back to 75K lowered training to 5e-6/-7

11.25-22

Trained: 5e-6/-7 Failed NaN'd
Went back to 300K lowered to 1e-6/-7. Play this until you can't any longer. Goal is to hit 1.2M prior to No Warp. Next up is to lower Epsilon to -6.

Trained: 1e-6/-7; 1.5hrs+/14.25K/314K+; face_a: 0.08857, face_b: 0.06214

Trained: 1e-6/-7; 4.5+hrs/39.5+K/353+K (Failed NaN)

I couldn't get a stable model at the lowest learning rate of 1e-6, and even (not noted) lowered the epsilon.

That's when it occurred to me to try something different and do shorter training cycles. Instead of a cycle that lasts for hours, I wondered if doing shorter 1 hr cycles 24 times would give me results as good as one 24hr cycle. I raised the learning rate back to 1e-5/-7, and started off at cycles of 6,000 iterations. That worked. I eventually raised it to 11,500 (1 hr 22 minutes) with no NaN warnings and so far no image degradation.

Notes:

11.26-22

Trained x2: 1e-5/-7; 79Min/11K/367K; face_a: 0.09965, face_b: 0.07019

Trained 3x: 1e-5/-7; 1+hrs/10.5K+

11.27.22

Trained 8X: 1e-5/-7; 8+hrs/53.5+K

11.28.22

Trained 3X: 1e-5/-7; 3hrs/25K/465.5K+/face_a: 0.09068, face_b: 0.06945

Trained 3X: 1e-5/-7; 3hrs/23.5+K/489.5K; face_a: 0.09164, face_b: 0.06728

11.30.22

Trained 8X: 1e-5/-7; 6.5+hrs/58.5/546+K; face_a: 0.09080, face_b: 0.06523

Upped iterations to 11500 based on the amount of pictures, 5707 and 5650. This should give each side two complete epochs)

12.01.22

Trained 8X: 1e-5/-7; 7.5+hrs/60K/600K; face_a: 0.08676, face_b: 0.06371 (2pm)

12.02.22

Trained 3X: 3hrs+/29K+/618K+; face_a: 0.08618, face_b: 0.06286 (1:11PM)

12.03.22

Trained 5X: 1e-5/-7; 5hrs/35K+/665K; face_a: 0.08594, face_b: 0.06248 (109PM)

12.04.22

Trained 5x 1e-5/-7; 4.5+hrs/40.5K/704K+ ; face_a: 0.08507, face_b: 0.06162 (11OPM)

12.05.22

Trained 7x: 1e-5/-7; 7hrs/60K/764.5K ; face_a: 0.08448, face_b: 0.05999 (1245AM)

12.07.22

Trained 8X: 1e-5/-7; 6.5+hrs/57K+/822K; face_a: 0.08247, face_b: 0.05938 (137PM)

Trained 7X: 1e-5/-7; 9.5hrs/79K/901K; face_a: 0.08210, face_b: 0.05969 (11:22PM)

12.09.22

Trained 14x: 1e-5/-7; 16+hrs/130.5+K/1.035M; face_a: 0.07785, face_b: 0.05763

So far the training has surpassed the simpler Angie 1 with no NaNs, warnings, or roll backs. As I write training continues at 1.05 million iterations surpassing Angie 1's iterations, and loss values (currently face_a: 0.07863, face_b: 0.0588)

Why didn't it fail?

I am by no means an expert, and will fully admit my ignorance when it comes to machine learning or face swapping. However my hypothesis suggests it has to do with the model having to reset it's weights after every training cycle. AS mentioned by @bryanlyon in the previous post the model has to reset otherwise it would be too large to store, this causes it to continually have to catch back up when restarted as confirmed by @torzdf . Imagine if you had a long list of equations to solve and after every 10 equations you were forced to go back three equations to re-check your work. It would probably make your work more accurate, right? I believe that's what's happening, it's baby-stepping across a wiggly bridge and double checking it's steps when it has to restart. Shorter training cycles = Better Accuracy.

But it's not just this uneducated idiot that's wondering about this:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7597167/

The National Center for Biotechnology Information published an abstract called:

"Can Short and Partial Observations Reduce Model Error and Facilitate Machine Learning Prediction?"

Granted, this has nothing to do with Image learning but they came to the same conclusions for Machine Learning Predictions.

I'm only providing the data and by no means am I claiming I'm right, but at the very least I think it proves that there's something here to research and contemplate.

But it also brings up other questions, like how does short training cycles relate to low batches? WE already know that smaller batches need smaller learning rates. But why is it my model was failing at the lowest learning rates, yet when I raised the learning rate and gave it a shorter training cycle there were no problems (as of yet)? So is there more at play then just learning rates? Or, is this tied to VRAM? I have a 3080ti, 12gb of memory, working on a Windows machine, and my model couldn't get more than a batch size of one - even on central storage. Could shorter training cycles be the key to problems that occur at smaller batch sizes due to VRAM? OR, how about Mixed Precision? We already know that Mixed Precision is a Love/Hate relationship. Is this the answer to that problem of Mixed Precision? Again, I am on Mixed Precision, with a borderline large model, on a batch size of one, all of these parameters are a problem, but I had no problems when using shorter training cycles.

As I said, I am an idiot. I am like a freshman in high school suggesting to Grad Students a solution to a complex problem. With that being said, based on my data, I am officially submitting a Pull-Request for a training cycle option, and at the very least I think this is a subject that should be considered for further research.

I have all my .jsons for both Angie 1 & 2, and I believe the logs (though I may have accidentally deleted early logs of Angie 2) if that type of hard data is desired, but I think it just repeats what I posted.

Also, please correct me if I am wrong with any information posted.

Post by **bryanlyon** » Sat Dec 10, 2022 5:35 am

Very interesting. If someone else can try and verify these results I'd appreciate it. I don't typically get many NaNs so cannot test this well. This may justify an option in the training script if others can replicate.

MaxHunter · Post by **MaxHunter** » Sat Dec 10, 2022 5:58 am

I'll tell you this much, I almost want to go back to Angie 1, and see if shorter training times can cure the NaN at 1.025 million just for the hell of it.

Post by **torzdf** » Sat Dec 10, 2022 11:33 am

Possibly related (for my reference):
https://paperswithcode.com/method/cosine-annealing

MaxHunter · Post by **MaxHunter** » Tue Dec 20, 2022 7:57 pm

I just wanted to post an update as I just reached 1.5 million iterations.

Since the last post I have experienced 2 NaN warnings, both were after I raised the iterations limit. The first time I raised the iterations to 30k and received a warning around 28k on the second cycle. I then lowered the iterations to 20k and after numerous cycles I finally got a warning last week at 18k (interesting that both were 2k short of the limit, probably a coincidence but interesting none the less.). I have not received one since and have kept the limit around 12k.

I have theory that these NaN warnings are due to the VRAM possibly fragmenting. Thoughts? Is that possible?

My thinking is that the VRAM is holding on to the equations and it's dropping certain numbers when it runs out of memory to make room. Kind of like, 1+1+1+dropped number+1=12, causing the program to insert a number that doesn't make sense. Is this possible?
(And forgive me if the answer to that question is, "duh!" )

Post by **torzdf** » Thu Dec 22, 2022 1:03 pm

MaxHunter wrote: ↑Tue Dec 20, 2022 7:57 pm
I have theory that these NaN warnings are due to the VRAM possibly fragmenting. Thoughts? Is that possible?

Very unlikely.

NaNs will be numerical over/under flows due to the precision.

Faceswap Forum

Shorter Training=Smaller Chances of NaNs; A hypothesis by an Uneducated Idiot

Shorter Training=Smaller Chances of NaNs; A hypothesis by an Uneducated Idiot

10-12-22

10-22-22

10-23-22

11.21.22

11.22.22

11.23.22

11.24-22

11.25-22

11.26-22

11.27.22

11.28.22

11.30.22

12.01.22

12.02.22

12.03.22

12.04.22

12.05.22

12.07.22

12.09.22

Re: Shorter Training=Smaller Chances of NaNs; A hypothesis by an Uneducated Idiot

Re: Shorter Training=Smaller Chances of NaNs; A hypothesis by an Uneducated Idiot

Re: Shorter Training=Smaller Chances of NaNs; A hypothesis by an Uneducated Idiot

Re: Shorter Training=Smaller Chances of NaNs; A hypothesis by an Uneducated Idiot

Re: Shorter Training=Smaller Chances of NaNs; A hypothesis by an Uneducated Idiot