AutoClip: Any Feedback From Users?

ianstephens · Post by **ianstephens** » Fri Jul 01, 2022 3:46 pm

Just playing about with the recently introduced AutoClip feature.

How much effect would this have on NaNs? Strong? Or is it still worth tapering down the EE with mixed-precision training?

Does anyone have any feedback on this feature so far with very complex models while using mixed precision?

I might also add - does anybody notice it affecting training accuracy?

Thank you for any feedback!

Post by **torzdf** » Fri Jul 01, 2022 4:06 pm

My experience has been that it has not been the NaN saviour that I had hoped it would be, so my battle to combat NaNs continues

I have seen no negative impact on accuracy, though my sample set is small.

This is a link to original paper if anyone wishes to investigate the claims themselves:

https://arxiv.org/abs/2007.14469

Post by **torzdf** » Sun Jul 03, 2022 10:49 am

FWIW I have been doing a lot of debugging around NaNs. They do happen in ML models, but I wanted to get a better understanding of why they occur.

This has been a time-consuming and laborious process. However, my investigations have led me to conclude that the main issue is within Keras/TF implementation of Batch Normalization. This appears to be a known issue in shared layers (which is where our BatchNorm exists... in the shared encoder). But the issue appears to have been closed with no action taken:

https://github.com/keras-team/keras/issues/11927

I am now running more tests to confirm this is the issue.

The main challenge I now face is how to mitigate this, as this is an embedded layer in many of the Keras Encoders this will be non-trivial, if it is even possible at all.

I will update with any further findings.

Post by **torzdf** » Tue Jul 05, 2022 6:34 pm

Just to follow up, and I no longer think this is an issue we are facing. I ran some tests, and could see no noticeable difference between shared and unique layers.

I have, however read this from Nvidia's guide on Mixed Precision:

While many networks match FP32 training results when all tensors are stored in FP16, some require updating an FP32 copy of weights. Furthermore, values computed by large reductions should be left in FP32. Examples of this include statistics (mean and variance) computed by batch-normalization, SoftMax.

So I am currently investigating forcing BN layers to fp32. I, however, do not think that this will solve the issue, as in my recent tests on a NaN model, the NaNs were getting introduced in the Decoder. This was occuring during the forward pass, which AutoClip would not resolve.

It may just be that I need to accept that bigger models need lower learning rates, especially when Mixed Precision is used.

ianstephens · Post by **ianstephens** » Sun Jul 17, 2022 10:17 pm

Using mixed precision, I am finding that in some tests even with Learning Rate super low (3e-05 or less) on large models, the only way to mitigate NaN is to set the EE to -4 or less. Learning rate, even super low does not seem to fix NaN but setting the EE at a low number seems to help and mitigate the issue - albeit at a less than perfect model and a lot slower to train.

MaxHunter · Post by **MaxHunter** » Sun Jul 31, 2022 10:24 pm

I've tried looking this up and scanned the "paper," but in layman's terms, what does autoclip do, and how is it applicable?

Post by **torzdf** » Mon Aug 01, 2022 12:44 am

Gradient clipping is a mechanism to help prevent exploding/vanishing gradients (that is numbers that go to +/- infinity or to 0). Both of these will cause a model to NaN (Mixed Precision is more prone to this, as infinity in limited precision space is a smaller number than infinity in full precision space... This doesn't sound like it makes sense, but think of infinity as any number that cannot be represented by a certain numerical precision).

There are several methods to clip gradients. You can clip-max (i.e. clip all numbers at 1.0) or you can clip gradients to an adjusted norm. Most ML libraries expect to give you a number to clip the normalization, but it really is data dependant. Auto-clip is a mechanism for scanning the normal distribution of gradients, and auto-adjust the clipping value by what it sees in the data.

This probably still doesn't make a whole lot of sense, but it's the best that I can explain it for now. It's basically adaptive, rather than expecting me/the user to come up with an arbitrary number ahead of time.

I may add the other clipping mechanisms into Faceswap, just because it's an easy add, but I would expect autoclip to work better.

MaxHunter · Post by **MaxHunter** » Mon Aug 01, 2022 3:25 am

Thanks, man. This is how we learn!

Faceswap Forum

AutoClip: Any Feedback From Users?

AutoClip: Any Feedback From Users?

Re: AutoClip: Any Feedback From Users?

Re: AutoClip: Any Feedback From Users?

Re: AutoClip: Any Feedback From Users?

Re: AutoClip: Any Feedback From Users?

Re: AutoClip: Any Feedback From Users?

Re: AutoClip: Any Feedback From Users?

Re: AutoClip: Any Feedback From Users?