Comparing loss functions with descriptions and examples

Post by **abigflea** » Fri Sep 18, 2020 2:03 am

As I've begun to learn face swap and it's parts, I started paying attention to this "loss function" mentioned in setting the training options.

I will try to characterize below what I learned.

TL:DR I like Smoothed and SSIM. Most functions were generally the same quality, but not the same.

A big disclaimer; I did plagiarize mercilessly from the internet. why waste time recreating what other people have done to explain this? I'm not in college or getting graded

The loss function is basically how the neural network grades itself that it's doing a good job. If it's learning or not. Imagine doing a math test, but never being graded on whether is was right or wrong. The neural network needs feedback.
Example of a broken loss function.

The graph below visually depicts how this is done. What you want is the N.N. model to be at the global minimum. Meaning it wont learn any more.

As your neural network trains/learns it tries to go lower and lower on the graph using these functions.
Imagine it's like a ball rolling down a hill. At the lowest point, it has found the best solution.... The best swap.

In reality the graph is not all that smooth. There are bumps and valleys along the way. Things that are not that perfect.
The graph below it shows you what's actually being dealt with while training (a bit exaggerated here).

There is no absolute perfect, this is subjective, based on perception and requires a clear degree of artistry.
For perfect as possible, your loss has to be at this minuscule number, the global minimum, that you would have to zoom in on, and the 'ball' does jump around a bit randomly.

As that ball goes down the slope, the gradient, it can get stuck in places that are not quite the best.

With a higher learning rate it gets it to the bottom quicker, but won't settle down completely. Might even jump back up into a less than optimal position for a while
A lower learning rate slides down the slope/gradient slower, maybe even meticulously, but can get stuck along the way and never make it to the global minimum loss.
For Faceswap, just leave learning rate at default. Only adjust it in specific circumstances, like if your model keeps crashing.

The loss functions are different methodologies of getting that ball to the best lowest point, the global minimum, and the best solution.

All models were done with DFL-SAE for the balanced A<-->B training
128 in/out : Eye/Mouth multiplier set to 1 : Extended
Did not try to make it look good so all results were on even keel. Checking loss function results not my ability to make it work.

MSE is the sum of squared distances between the target variable and predicted values. This is the most commonly used loss function in general NN training, not just neural networks with specifically images.

MAE is the sum of absolute differences between the target and predicted variables. It measures the average magnitude of errors in a set of predictions, without considering their directions.

MAE loss is more robust to outliers, but its derivatives are not continuous, making it inefficient to find the solution.
MSE loss is sensitive to outliers, but gives a more stable solution.

Smoothed Mean Absolute Error (or Huber Loss)
Using MAE for training of neural nets has a constantly large gradient, which can lead it to bouncing near the best solution.
Using MSE, the gradient decreases as the loss gets close to its best solution, making it more precise, but can get stuck.

Smoothed or Huber loss can be really helpful in such cases, as it curves around the minimum which decreases the gradient. It is more robust to outliers than MSE. Therefore, it combines good properties from both MSE and MAE. Although it also has it's flaws.

Log Cosh works mostly like MSE, but will not be so strongly affected by the occasional wildly incorrect prediction. It has all the advantages of Huber loss. Log-cosh loss isn’t perfect. It still suffers from the problem of gradient and hessian for very large off-target predictions being constant.

SSIM is a perception-based model that considers image degradation as perceived change in structural information. Structural information is the idea that the pixels have strong inter-dependencies especially when they are spatially close. These dependencies carry important information about the structure of the objects in the visual scene.

Pixel-Grad-diff I couldn't find much on this.
It seemed to learn very slow and jump around a lot. I did run this model another 150K and didn't improve dramaticly. Its slow learning, too inpatient to train for another week. This may be useful in some data situation although I feel the above do better jobs with images.

Further Reading:
https://youtu.be/QBbC3Cjsnjg
https://juinerurkar.com/understanding-gradient-descent/
https://machinelearningmastery.com/loss ... -networks/
https://towardsdatascience.com/my-first ... f9b01c6925