[Topic split from: viewtopic.php?t=2058]
I have done a not inconsiderable amount of testing around NaNs recently, really hoping I would find a solve, and I have drawn a blank. Where they get introduced in a model is inconsistent, although they nearly always seem to appear at some point in the forward pass, which clipping would not be able to help with.
The sad fact is that Mixed Precision does increase the chance of NaNs, that is just the nature of using a more limited numerical range. My focus, more recently, has been on looking at other ways to reduce VRAM usage, to easier enable Full Precision training, but Tensorflow is making this particularly hard for me. Method's I could easily introduce in TF1.x have been totally disposed of since TF2.x, and implementing any are somewhere between very difficult and impossible. If I could start this project again I would, undoubtedly, use PyTorch, where implementing these kinds of features is a LOT easier. TF devs just don't appear to care (about this and a myriad of other issues).
The other factor is that bigger and more complex models need lower learning rates. Latest models I'm developing need to be starting at between 1e-5 to 3e-5. That is also just a matter of fact.
Yet another factor feeding this is that batch size is proportional to learning rate. If you lower the batch size, you should lower the learning rate. This is kind of logical, as smaller batch-sizes mean that outliers will have a larger effect on gradients. As models get larger and more complex, lower batch sizes are unavoidable. There has been some recent research around this, which I have not yet had an opportunity to fully digest: https://arxiv.org/abs/2006.09092