Page 1 of 1

[Guide] Training in Faceswap

Posted: Sun Sep 29, 2019 10:38 pm
by torzdf
NB: This guide was correct at time of writing, but things do change. I will try to keep it updated.

Contents

Introduction

A lot of people get overwhelmed when they start out Faceswapping, and many mistakes are made. Mistakes are good. It's how we learn, but it can sometimes help to have a bit of an understanding of the processes involved before diving in.

In this post I will detail how we train a model. There are several models with many options. I won't cover everything off, but hopefully this will give you enough to make informed decisions of your own. If you have not already generated your face sets for training, then stop right now and head over to the Extract Guide to generate them now.

There is quite a lot of background information in this guide. I advise that you familiarize yourself with it all. Machine Learning is a complicated concept, but I have tried to break it down to be as simple to understand as possible. Having a basic understanding about how the Neural Network works, and the kind of data it benefits from seeing will vastly improve your chances of achieving a successful swap.

I will be using the GUI for this guide, but the premise is exactly the same for the cli (all the options present in the GUI are available in the cli).

What is Training?
Overview
At a high level, training is teaching our Neural Network (NN) how to recreate a face. Most of the models are largely made up of 2 parts:
  1. Encoder - This has the job of taking a load of faces as an input and "encoding" them into a representation in the form of a "vector". It is important to note that it is not learning an exact representation of every face you feed into it, rather, it is trying to create an algorithm that can be used to later reconstruct faces as closely as possible to the input images.
  2. Decoder - This has the job of taking the vectors created by the encoder and attempting to turn this representation back into faces, as closely matching the input images as possible.
why1.png
why1.png (7.17 KiB) Viewed 6280 times
Some models are constructed slightly differently, but the basic premise remains the same.

The NN needs to know how well it is doing encoding and decoding faces. It uses 2 main tools to do this:
  1. Loss - For every batch of faces fed into the model, the NN will look at the face it has attempted to recreate by its current encoding and decoding algorithm and compare it to the actual face that was fed in. Based on how well it thinks it has done, it will give itself a score (the loss value) and will update its weights accordingly.
  2. Weights - Once the model has evaluated how well it has recreated a face it updates its weights. These feed into the Encoder/Decoder algorithms. If it has adjusted its weights in one direction, but feels it has done a worse job of reconstructing the face than previously, then it knows that the weights are moving in the wrong direction, so it will adjust them the other way. If it feels it has improved, then it knows to keep adjusting the weights in the direction that it is going.
why2.png
why2.png (22.21 KiB) Viewed 6280 times
The model then repeats this action many, many times constantly updating its weights based on its loss values, theoretically improving over time, until it reaches a point where you feel it has learned enough to effectively recreate a face, or the loss values stop falling.

Now we have the basics of what a Neural Network does and how it learns to create faces, how does this apply to face swapping? You may have noticed in the above breakdown that this NN learns how to take a load of faces of a person and then reconstruct those faces. This isn't what we want though... we want to take a load of faces and reconstruct someone else's face. To achieve this, our NN does a couple of things:
  • Shared Encoder - When we train our model, we are feeding it 2 sets of faces. The A set (the original faces that we want to replace) and the B set (the swap faces that we wish to place in a scene). The first step to achieve this is sharing the Encoder for both the A and B set. This way our encoder is learning a single algorithm for 2 different people. This is extremely important, as we will ultimately be telling our Neural Network to take the encodings of one face and decode it to another face. The encoder therefore needs to see, and learn, both sets of faces that we require for the swap.
  • Switched Decoders - When training the model, we train 2 decoders. Decoder A is taking the encoding vectors and attempting to recreate Face A. Decoder B is taking the encoding vectors and attempting to recreate Face B. When it comes to finally swapping faces, we switch the decoders around, therefore we feed the model Face A, but pass it through Decoder B. As the Encoder has been trained on both sets of faces, the model will encode the input Face of A, but then attempt to reconstruct it from Decoder B, resulting in a swapped face being output from our model.
why3.png
why3.png (38.63 KiB) Viewed 6278 times
Terminology
There is some common Machine Learning Terminology that you will see when using Faceswap. To make life simpler, A glossary of terms is shown here:
  • Batch - A batch is a group of faces that are fed through the Neural Network at the same time.
  • Batch Size - Batch Size is the size of the batch that is fed through the Neural Network at the same time. A batch size of 64 would mean that 64 faces are fed through the Neural Network at once, then the loss and weights update is calculated for this batch of images. Higher batch sizes will train faster, but will lead to higher generalization. Lower batch sizes will train slower, but will distinguish differences between faces better. Adjusting the batch size at various stages of training can help.
  • Epoch - An epoch is one complete representation of the data fed through the Neural Network Eg: if you have a folder of 5000 faces, 1 Epoch will be when the model has seen all 5000 faces. 2 Epochs will be when the model has seen all 5000 faces twice, and so on. In terms of Faceswap, Epoch is not actually a useful measure. As a model is trained on 2 data sets (side A and side B) unless these datasets are of exactly the same size (very unlikely), it is impossible to calculate one Epoch as it will be different for each side.
  • Example - An example, in terms of Faceswap, is another name for "Face". It is basically a single face that is passed through the Neural Network. If the model has seen 10 examples, then it has seen 10 faces.
  • EG/s - This is the number of examples that the Neural Network sees per second, or in terms of Faceswap, the number of faces that the model is processing every second.
  • Iteration - An iteration is one complete batch processed through the Neural Network. So 10 iterations at batch size 64 would mean that 640 (64 * 10) faces have been seen by the model.
  • NN - An abbreviation for Neural Network.

Training Data

It cannot be overstated enough how important the quality of data is to your model. A smaller model can perform very well with decent data, similarly no model will perform well with poor data. At an absolute minimum there should be 500 varied images in each side of your model, however the more data, and the more varied, the better... up to a point. A sane number of images to use is anywhere between 1,000 and 10,000. Adding many more images than this can actually hurt training.

Too many similar images will not help your model. You want as many different angles, expressions and lighting conditions as possible. It is a common misconception that a model is trained for a specific scene. This is "memorization" and is not what you are trying to achieve. You are trying to train the model to understand a face at all angles, with all expressions in all conditions, and swap it with another face at all angles, with all expressions in all conditions. You therefore want to build a training set from as many different sources as possible for both the A and B set.

Varied angles for each side are highly important. A NN can only learn what it sees. If 95% of the faces are looking straight at the camera and 5% are side on, then it will take a very long time for the model to learn how to create side on faces. It may not be able to create them at all as it sees side on faces so infrequently. Ideally you want as even distribution as possible of face angles, expressions and lighting conditions.

Similarly, it is also important that you have as many matching angles/expressions/lighting conditions as possible between both the A and B sides. If you have a lot of profile images for the A side, and no profile images for the B side, then the model will never be able to perform swaps in profile, as Decoder B will lack the information required to create profile shots.

The quality of training data should generally not be obscured and should be of a high quality (sharp and detailed). However, it is fine to have some images in the training set that are blurry/partially obscured. Ultimately in the final swap some faces will be blurry/low resolution/obscured, so it is important for the NN to see these types of images too so it can do a faithful recreation.

More detailed information about creating training sets can be found in the Extract guide.


Choosing a model

There are several models available in Faceswap and more will be added over time. The quality of each one can be highly subjective so this will provide a brief overview of each (currently) available. Ultimately the model which works best for you can come down to many factors, so there is no definitive answer. There are pros and cons of each, however as stated above, the single most important factor is the quality of data. No model will fix data issues.

You will see mention below of input and output sizes (e.g. 64px input, 64px output). This is the size of the face image that is fed to the model (input) and the size of the face that is generated from the model (output), All faces fed to the models are square, so a 64px image will be 64 pixels wide by 64 pixels high. It is a common misconception that higher resolution inputs will lead to better swaps. Whilst it can help, this is not always the case. The NN is learning how to encode the face into an algorithm and then decode that algorithm again. It only needs enough data to be able to create a solid algorithm. Input resolution and output quality are not directly linked.

It is worth noting that the larger the model, the longer it will take to train. The original model can take anywhere from 12-48 hours to train on a Nvidia GTX 1080. Villain can take over a week on the same hardware. It is commonly thought that a model with double the input size will take twice as long. This is incorrect. It will take at least 4 times as long, and probably longer. This is because a 64px image has 4,096 pixels. However a 128px image has 16,384 pixels. This is 4 times as many, add to this that a model needs to be scaled to handle this increased volume of data and training time can quickly stack up.
  • Lightweight (64px input, 64px output) - This is an extremely stripped down model, designed to run on GPUs with <=2GB of VRAM. It is not what you would call "production ready", but it enables users with lower end hardware to train a model. On higher end GPUs it will train very quickly, so it can be useful for quickly seeing how well a swap might work prior to moving to a more taxing model.
  • Original (64px input, 64px output) - The model that started it all. Still can provide great results and useful to understand how your dataset quality is really one of the biggest drivers of swap quality.
  • IAE (64px input, 64px output) - A model with a slightly different structure to the other models. It has a shared Encoder and a shared Decoder, but 3 Intermediate layers (one for A, one for B and one shared) which sit between the Encoder and Decoder. It is structured this way to try to better separate identity. More can be read about this model here: https://github.com/deepfakes/faceswap/pull/251
  • Dfaker (64px input, 128px output) - This model leverages some different techniques from the original model, and also focuses on upscaling an input to a higher resolution output. Despite being around for a while, this model still achieves great results, whilst its lack of customization options makes it an easy 'fire and forget' model.
  • Unbalanced (64-512px input, 64-512px output) - This is a powerful model that has a lot of ways to customize and improve the model but requires a bit more expertise and know-how to get good results. Has arguably been superseded by "RealFace". It is worth noting that this model puts more emphasis on the B Decoder, so that reversing a swap (i.e. swapping B>A rather than A>B) will lead to less satisfactory results.
  • DFL-H128 (128px input, 128px output) - This model actually uses the exact same encoder and decoder as Original but then uses a 128px input instead of 64px and then tries to compress the image into a representation of the face half as large as Original. The smaller 'latent space' has some downsides in quality vs. Original that negate the larger input size.
  • DFL-SAE (64-256px input, 64-256px output) - This model contains two different network structures within it, one based on the Original shared encoder/split decoders model and one based on the IAE Model (shared intermediate layers). Has numerous customization options. Gives good detail, but can lead to some identity bleed (that is, some features of A may still be visible in B).
  • Villain (128px input, 128px output) - Villain is likely the most detailed model but very VRAM intensive and can give sub-par color matching when training on limited sources. Is the source of the viral Steve Buscemi/Jennifer Lawrence Deepfake. As this model does not have any customization options (beyond a low memory variant) it is a decent choice if you want a higher resolution model without having to adjust any settings.
  • Realface (64-128px input, 64-256px output) - The successor to the Unbalanced model. Takes learnings from that model and Dfaker, whilst looking to develop them further. This model is highly customizable, but it's best to tweak the options when you have some idea of what you are doing and what impact the settings will have. As with the unbalanced model this model puts more emphasis on the B Decoder, so that reversing a swap (i.e. swapping B>A rather than A>B) will lead to less satisfactory results.

Model configuration settings

Ok, you've chosen your model, let's get Training! Well, hold up there. I admire your eagerness, but you are probably going to want to set some model specific options first. I will be using the GUI for this, but the config file (if using the command line) can be found in your faceswap folder at the location faceswap/config/train.ini.

I won't go into each individual model's options as these are varied and keeping it updated for new models will be tough, but I will give an overview of some of the more common options. We will focus more on the global options which apply to all models. All of the options have tooltips, so hover over an option to get more information about what it does.

To access the model configuration panel, go to Settings > Configure Train Plugins...:
train1.png
train1.png (4.95 KiB) Viewed 6273 times
  • Global
    These are options that apply to all models:
    Train2.jpg
    Train2.jpg (111.86 KiB) Viewed 6273 times
    All of the options on this page, with the exception of Learning Rate only take effect when creating new models. Once you start training a model, the settings chosen here are "locked" to that model, and will be reloaded whenever you resume training, regardless of what is set here.
    • Face
      Options that apply to the face being fed into the model
      Face0.jpg
      Face0.jpg (7.27 KiB) Viewed 6269 times
      • Coverage - This is the amount of the source image that will be fed into the model. A percentage of the image is cropped from the center by the amount given. The higher the coverage percentage, the more of the face will be fed in. An illustration of the amount of image that is cropped is shown below:
        coverage.jpg
        coverage.jpg (76.39 KiB) Viewed 6273 times
        Whilst, intuitively, it may seem that higher coverage is always better, this is not actually the case, it is a trade-off. Whilst a higher coverage will mean more of the face gets swapped, the input size of the model always remains the same, so the resulting swap will likely be less detailed, as more information needs to be packed into the same size image. To illustrate, below is an extreme example of the same image with 62.5% coverage and 100% coverage, both sized to 32px. As you can see the 100% coverage image contains far less detail than the 62.5% version. Ultimately the right choice for this option is up to you:
        coverage2.jpg
        coverage2.jpg (33.86 KiB) Viewed 6273 times
    • Mask
      Options that apply to training with a mask.
      Mask1.jpg
      Mask1.jpg (10.89 KiB) Viewed 6269 times
      Setting a mask is a way of indicating which area of an image is important. In the below example, the red area is "masked out", (ie: it is considered unimportant), whilst the clear area is "masked in" (It is the face so the area we are interested in):
      Mask0.png
      Mask0.png (30.03 KiB) Viewed 6269 times
      Training with a mask can serve two purposes:
      1. It focuses the training on the face area, forcing the model to provide less importance to the background. This can help the model learn faster, whilst also ensuring that it is not taking up space learning background details that are not important.
      2. The learned mask can be used in the conversion stage. It is arguable whether learned masks, in the current implementation, offer any benefit over using a standard mask when converting, but training with a mask ensures you have the option to use it
      NB: If you are training with a mask, then an alignments file must be provided in each of the input folders for the A and B faces.
      • Mask Type- The type of mask to be used for training. There are currently several masks available, but I will just list the 3 that I think are worth using:
        1. None - IE Train without a mask. If you are training with low coverage, then training with a mask will bring minimum benefit, at the cost of taking up VRAM which could be better utilized elsewhere. There is no downside to training with a mask at low coverage (outside of the extra VRAM required), but its benefits will be limited.
        2. Components - This is a multi-part mask built around the outside points of the face as generated from the alignments data.
          Mask1.png
          Mask1.png (29.67 KiB) Viewed 6269 times
        3. Extended - One of the main problems with the components mask is that it only goes to the top of the eyebrows. The Extended Mask is based off the components mask, but attempts to go further up the forehead to avoid the "double eyebrow" issue, (i.e. when both the source and destination eyebrows appear in the final swap.). This is the mask I tend to use.
          Mask2.png
          Mask2.png (30.68 KiB) Viewed 6269 times
      • Mask Blur - This applies a slight blur to the edge of the mask. In effect it removes the hard edges of the mask and blends it more gradually from face to background. This can help with poorly calculated masks. It's up to you whether you want to enable this or not.
    • Initialization
      Options that apply to initializing your model.
      Init0.jpg
      Init0.jpg (8.86 KiB) Viewed 6269 times
      As discussed in the Training overview the model has weights which get updated at the end of each iteration. Initialization is the process by which these weights are first set. You can see this as giving the model a helping hand getting started. As we know what our NN is going to be used for, we can set these weights at values which will help the model have a jump start in life.

      The default method for initialization is "he_uniform". This draws samples from a uniform distribution. It is not this guide's goal to go into different initialization methods, and what they mean, but this default can be overridden by the options exposed within this section.

      It should be noted that some models set their initialization method internally for some layers of the model, so these layers will not be impacted by this setting. However, for layers and models where this has not been explicitly set, the initializer will be changed to the selected option.

      Both of the existing initializers can be used together (i.e. they can both be enabled with no ill effects). I tend to enable them both.
      • ICNR Init - This initializer is only applied to upscale layers. Standard initialization can lead to "checkerboard" artefacts in the output images when they get upscaled within a NN. This initializer seeks to prevent these artefacts. More information can be read about this method in this paper: https://arxiv.org/abs/1707.02937
      • Conv Aware Init - Convolutional Aware Initialization is applied to all convolutional layers within the model. The premise behind this initializer is that it takes into account the purposes of the convolutional network and initializes weights accordingly. Theoretically this leads to higher accuracy, lower loss and faster convergence. More about this initializer can be read in the following paper: https://arxiv.org/abs/1702.06295

        NB: This initializer can take more VRAM when it starts, so it is advised to start with a lower batch size, start the model, then restart the model with your desired batch size.
        NB: This initializer won't run with multi-gpu mode enabled, so if training with multiple GPUs, you should commence training on 1 GPU, stop the model, then continue with multi-gpu enabled.
    • Network
      Options that apply to layers within the model
      Network0.jpg
      Network0.jpg (10.54 KiB) Viewed 6269 times
      The options here apply to some of the layers used within the model.
      • Subpixel Upscaling - This is an alternative method for upscaling images within a Neural Network. It actually does exactly the same job as they default Pixel Shuffler layer, just using different TensorFlow Operations. I would recommend just leaving this off as it makes no difference (and will probably be removed in the future)
      • Reflect Padding - Some models, notably Villain, and to a lesser extent DFL-SAE, have a notable 'gray box' around the edge of the swap area in the final swap. This option changes the type of padding used in the convolutional layers to help mitigate this artefact. I only recommend enabling it for those two models, otherwise I would leave this off.
    • Loss
      The loss function to use.
      Loss0.jpg
      Loss0.jpg (12.77 KiB) Viewed 6269 times
      There are various different ways to calculate Loss, or for a NN to discern how well it is doing at training a model. I won't go into details of each of the functions available, as it would be a somewhat lengthy process and there is plenty of information about each of these functions on the internet.
      • Loss Function - The most popular loss methods to use are MAE (Mean Absolute Error) and SSIM (Structural Similarity). My personal preference is to use SSIM.
      • Penalized Mask Loss - This option dictates whether areas of the image that fall outside of the face area should be given less importance than areas that fall inside the face area. This option should always be enabled
    • Optimizer
      Options that pertain to the optimizer.
      Optimizer0.jpg
      Optimizer0.jpg (9.6 KiB) Viewed 6269 times
      The optimizer controls the rate of learning for the Neural network.
      • Learning Rate - This should generally be left alone, except in instances where the model is collapsing (all the images change to a solid color block, and the loss spikes to a high level and never recovers). Unlike the other parameters on this page, this value can be adjusted for existing models.

        The learning rate dictates how far weights can be adjusted up or down at each iteration. Intuition would say a higher learning rate is better, but this is not the case. The model is trying to learn to get to the lowest loss value possible. A learning rate set too high will constantly swing above and below the lowest value and will never learn anything. Set the learning rate set too low and the model may hit a trough and think it has reached it's lowest point, and will stop improving.

        Think of it as walking down a mountain. You want to get to the bottom, so you should always be going down. However, the way down the mountain is not always downhill, there are smaller hills and valleys on the way. The learning rate needs to be high enough to be able to get out of these smaller valleys, but not so high that you end up on top of the next mountain.
  • Model
    These are settings that are specific to each model plugin:
    Model1.png
    Model1.png (3.14 KiB) Viewed 6264 times
    As mentioned before, I won't cover model specific settings in any detail. These vary from plugin to plugin. However, I will cover some common options that you may see in each of the plugins. As always, each option will have a tooltip which can give you more information.
    • lowmem - Some plugins have a "lowmem" mode. This enables you to run a stripped down version of the model, taking up less VRAM, but at the cost of worse fidelity.
    • input size - Some plugins allow you to adjust the size of the input being fed into the model. Inputs are always square, so this is the size, in pixels, of the width and height of the image that is fed into the model. Don't believe that larger input always equals better quality. This is not always the case. There are many other factors that determine whether a model will be of a decent quality. Higher input sizes take exponentially more VRAM to process.
    • output size - Some plugins allow you to adjust the size of the image being generated by the model. Input size and output size does not have to be the same, so some models contain upscalers that return a larger output image than input image.
  • Trainer
    The final tab in the configuration settings page is for Trainer, or "Data Augmentation" options:
    Trainer1.jpg
    Trainer1.jpg (123.87 KiB) Viewed 6263 times
    A NN needs to see many, many different images. In order to better learn a face it performs various manipulations on the input images. This is called "data augmentation". As mentioned in the notes, the standard settings will be fine for 99% of use cases, so only change these if you know what impact they will have.
    • Evaluation - Options for evaluating the status of training.
      Trainer2.jpg
      Trainer2.jpg (8.28 KiB) Viewed 6263 times
      • Preview Images - This is the number of faces that are shown in the preview window for each of the A and B sides of the swap.
    • Image Augmentation - These are manipulations that are performed on faces being fed into the model.
      Trainer3.jpg
      Trainer3.jpg (24.57 KiB) Viewed 6263 times
      • Zoom Amount - Percentage amount that the face is zoomed in or out before being fed into the NN. Helps the model to deal with misalignments.
      • Rotation Range - Percentage amount that the face is rotated clockwise or anticlockwise before being fed into the NN. Helps the model to deal with misalignments.
      • Shift Range - Percentage amount that the face is shifted up/down, left/right before being fed into the NN. Helps the model to deal with misalignments.
      • Flip Chance - Chance of flipping the face horizontally. Helps create more angles for the NN to learn from.
    • Color Augmentation - These augmentations manipulate the color/contrast of faces being fed into the model, to make the NN more robust to color differences.
      Trainer4.jpg
      Trainer4.jpg (24.47 KiB) Viewed 6263 times
      This is an illustration of what color augmentation does under the hood (you will not see this in your previews/final output, it is just for demonstration purposes):
      Image
      • Color Lightness - Percentage amount that the lightness of the input images are adjusted up and down. Helps to deal with different lighting conditions.
      • Color AB - Percentage amount that the color is adjusted on the A/B scale of the L*a*b color space. Helps the NN to deal with different color conditions.
      • Color CLAHE Chance - The percentage chance that the image will have Contrast Limited Adaptive Histogram Equalization applied to it. CLAHE is a contrast method that attempts to localize contrast changes. This helps the NN deal with differing contrast amounts.
      • Color CLAHE Max Size - The maximum "grid size" of the CLAHE algorithm. This is scaled to the input image. A higher value will lead to higher contrast application. This helps the NN deal with differing contrast amounts.
Once you have you model settings as you want them, hit OK to save the configuration and close the window.

NB: Hitting OK will save the options on all tabs, so make sure you review them carefully. You can hit Cancel to cancel any of your changes or Reset to revert all values to their default setting.


Setting up

Now you've got your faces in place, you have configured your model, it's time to kick things off!

Head over to the Train tab in the GUI:
Setup1.jpg
Setup1.jpg (47.77 KiB) Viewed 6261 times
This is where we will tell Faceswap where everything is stored, what we want to use, and actually start training.
  • Faces
    This is where we will tell Faceswap where the faces are stored, and the location of their respective alignments files (if required)
    setup_faces.jpg
    setup_faces.jpg (21.02 KiB) Viewed 6261 times
    • Input A - This is the location of the folder that contains your "A" faces that you extracted as part of the Extraction process. These are the faces that will be removed from the original scene to be replaced by your swap faces. There should be around 1,000 - 10,000 faces in this folder.
    • Alignments A - If you are training with a mask, or using "Warp to Landmarks" then you will require an alignments file for your faces. This will have been generated as part of the extraction process. If the file exists in your faces folder and is named alignments.json then it will automatically be picked up be the process. It is imperative that every single face in the faces A folder has an entry in the alignments file, otherwise training will fail. You may need to merge several alignments files. You can find more information about preparing an alignments file for training in the Extract Guide.
    • Input B - This is the location of the folder that contains your "B" faces that you extracted as part of the Extraction process. These are the faces that will be swapped onto the scene. There should be around 1,000 - 10,000 faces in this folder.
    • Alignments B - If you are training with a mask, or using "Warp to Landmarks" then you will require an alignments file for your faces. This will have been generated as part of the extraction process. If the file exists in your faces folder and is named alignments.json then it will automatically be picked up be the process. It is imperative that every single face in the faces B folder has an entry in the alignments file, otherwise training will fail. You may need to merge several alignments files. You can find more information about preparing an alignments file for training in the Extract Guide.
  • Model
    Options pertaining to the model that you will be training with:
    setup_model.jpg
    setup_model.jpg (34.47 KiB) Viewed 6261 times
    • Model Dir - This is where the model files will be saved. You should select an empty folder if you are starting a new model, or an existing folder containing the model files if you are resuming training from a model you have already started.
    • Trainer - This is the model you are going to train for your swap. An overview of the different models is available above.
    • Allow Growth - [NVIDIA ONLY]. Enables the TensorFlow GPU `allow_growth` configuration option. This option prevents TensorFlow from allocating all of your GPU VRAM at launch but can lead to higher VRAM fragmentation and slower performance. It should only be enabled if you are having problems with training (specifically, you are getting cuDNN errors).
  • Training
    Training specific settings:
    setup_training.jpg
    setup_training.jpg (28.48 KiB) Viewed 6261 times
    • Batch Size - As explained above, the batch size is the amount of images fed through the model at once. Increasing this figure will increase VRAM usage. Increasing batch sizes will speedup training to a certain point. small batch sizes provide a form of regulation that helps model generalization. while large batches train faster, batch sizes in the 8 to 16 range likely produce better quality. It is still an open question on whether other forms of regulation can replace or eliminate this need.
    • Iterations - The number of iterations to perform before automatically stopping training. This is really only here for automation, or to make sure training stops after a certain amount of time. Generally you will manually stop training when you are happy with the quality of the previews.
    • GPUs - [NVIDIA ONLY] - The number of GPUs to Train on. If you have multiple GPUs in your system, then you can utilize up to 8 of them to speed up training. Do note that this speed up is not linear, and the more GPUs you add, the more diminishing returns will kick in. Ultimately it allows you to train bigger batch sizes by splitting them across multiple GPUs. You will always be bottlenecked by the speed and VRAM of your weakest GPU, so this works best when training on identical GPUs. You can read more about Keras multi-gpu here
    • No Logs - Loss and model logging is provided to be able to analyse data in TensorBoard and the GUI. Turning this off will mean you do not have access to this data. Realistically, there is no reason to disable logging, so generally this should not be checked.
    • Warp To Landmarks - As explained earlier the data is warped so that the NN can learn how to recreate faces. Warp to Landmarks is a different warping method, that attempts to randomly warp faces to similar faces from the other side (i.e. for the A set, it finds some similar faces from the B set, and applies a warp, with some randomization). The jury is out on whether this offers any benefit/difference over standard random warping.
    • No Flip - Images are randomly flipped to help increase the amount of data that the NN will see. In most instances this is fine, but faces are not symmetrical so for some targets, this may not be desirable (e.g. a mole on one side of the face). Generally this should be left unchecked, and certainly should be left unchecked when commencing training. Later during the session you may want to disable this for some swaps.
    • No Augment Color - Faceswap performs color augmentation (detailed earlier). This really helps matching color/lighting/contrast between A and B, but sometimes may not be desirable, so it can be disabled here. The impact of color augmentation can be seen below:
      color-aug.jpg
      color-aug.jpg (79.9 KiB) Viewed 6254 times
  • VRAM Savings
    Optimization settings to save VRAM:
    setup_vram.jpg
    setup_vram.jpg (15.01 KiB) Viewed 6261 times
    Faceswap offers a number of VRAM saving optimizations which can enable users to train models that they may not otherwise be able to train. Unfortunately these options are currently only available for Nvidia users. These should be the last port of call. If you can train with a batch size of at least 6-8 without enabling these options, then you should do that first, as these all come with a speed penalty. All of these options can be enabled with each other, to stack savings.
    • Memory Saving Gradients - [NVIDIA ONLY] - MSG is an optimization methods that saves VRAM at computational cost. In best case circumstances it can halve your VRAM requirements at a 20% increase in training time. This is the first option you should try. You can read more about Memory Saving Gradients here.
    • Optimizer Savings - [NVIDIA ONLY] - This can save a fairly significant amount of VRAM by performing some optimization calculations on the CPU rather than the GPU. It does come at a cost of increasing system RAM usage, and slower training speeds. This should be the second option you try.
    • Ping Pong - [NVIDIA ONLY] - AKA "The last resort". This is by far the worst of the VRAM saving options, but it may be just enough to get what you need. This basically splits the model in two and will train 1 half of the model at a time. This can save up to 40% of VRAM but will take over double the amount of time to train the model. This should be the last option you try.
      NB: TensorBoard logging/graphing will not be available with this option enabled.
      NB: The preview will not display until a training cycle has been completed on both sides of the model.
  • Saving
    Options to schedule saving of model files:
    setup_saving.jpg
    setup_saving.jpg (14.62 KiB) Viewed 6261 times
    • Save Interval - How often the model should be saved out to disk. When a model is being saved, it is not being trained, so you can raise this value to get a slight speed boost on training (i.e. it is not waiting for the model to be written out to disk as often). You probably don't want to raise this too high, as it is basically your "failsafe". If a model crashes during training, then you will only be able to continue from the last save.
      NB: If using the Ping Pong memory saving option, you should not increase this above 100 as it is likely to be detrimental to the final quality.
    • Snapshot Interval - Snapshots are a copy of the model at a point in time. This enables you to rollback to an earlier snapshot if you are not happy with how the model is progressing, or to rollback if your save file has corrupted and there are no backups available. Generally, this should be a high number (the default should be fine in most instances) as creating snap shots can take a while and your model will not be training whilst this process completes.
  • Preview
    Options for displaying the training progress preview window:
    setup_preview..jpg
    setup_preview..jpg (13.17 KiB) Viewed 6261 times
    If you are using the GUI then generally you won't want to use these options. The preview is a window which is popped open that shows the progress of training. The GUI embeds this information in the "Display" panel, so the popped-out window will just show exactly the same information and is redundant. The preview updates at each save iteration.
    • Preview Scale -The popped out preview is sized to the size of the training images. If your training images are 256px, then the full preview window will be 3072x1792. This will be too big for most displays, so this option scales the preview down by the given percentage.
    • Preview - Enable to pop the preview window, disable to not pop the preview window. For GUI use generally leave this unchecked.
    • Write Image - This will write the preview image to the faceswap folder. Useful if training on a headless system.
  • Timelapse
    Options for generating an optional set of time-lapse images:
    setup_timelapse.jpg
    setup_timelapse.jpg (21.11 KiB) Viewed 6261 times
    The Time Lapse option is an optional feature that enables you to see the progress of training on a fixed set of faces over time. At each save iteration an image will be saved out showing the progress of training on your selected faces at that point in time. Be aware that the amount of space on disk that Time Lapse images take can stack up over time.
    Image
    • Timelapse Input A - A folder containing the faces you want to use for generating the time lapse for the A (original) side. Only the first 14 faces found will be used. You can just point this at your "Input A" folder if you wish to select the first 14 faces from your training set.
    • Timelapse Input B - A folder containing the faces you want to use for generating the time lapse for the B (swap) side. Only the first 14 faces found will be used. You can just point this at your "Input B" folder if you wish to select the first 14 faces from your training set.
    • Timelapse Output - The location where you want the generated time lapse images to be saved. If you have provided sources for A and B but leave this blank, then this will default to your selected model folder.
  • Global
    Global Faceswap Options:
    setup_global.jpg
    setup_global.jpg (18.17 KiB) Viewed 6261 times
    These options are global to every part of Faceswap, not just training.
    • Configfile - You can specify a custom train.ini file rather than using the file stored in the faceswap/config folder. This can be useful if you have several different known good configurations that you like to switch between.
    • Loglevel - The level that Faceswap will log at. Generally, this should always be set to INFO. You should only set this to TRACE if a developer has asked you to, as it will significantly slow down training and will generate huge log files.
      NB: The console will only ever log up to the level VERBOSE. Log levels of DEBUG and TRACE are only written to the log file.
    • Logfile - By default the log file is stored at faceswap/faceswap.log. You can specify a different location here if you wish.
    Once you have all your settings locked in, review them to make sure you are happy and hit the Train button to start training.

Monitoring Training

Once you have started training the process will take a minute or two to build the model, pre-load the data and start training. Once it has started, the GUI will enter Training mode, placing a status bar at the bottom and opening up some tabs on the right hand side:
  • Status Bar
    This appears at the bottom right and gives an overview of the current training session. It updates every iteration:
    monitor4.jpg
    monitor4.jpg (21.46 KiB) Viewed 6246 times
    You do not need to pay too close attention to the loss numbers here. For faceswapping they are effectively meaningless. The numbers give an idea of how well the NN thinks it is recreating Face A and how well it is recreating Face B. However we are interested in how well the model is creating Face B from the encodings of Face A. It is impossible to get a loss value for this as there are no real-world examples of a swapped face for the NN to compare against.
    • Elapsed - The amount of time that has elapsed for this training session.
    • Session Iterations - The number of iterations that have been processed during this training session.
    • Total Iterations - The total number of iterations that have been processed for all sessions for this model.
    • Loss A/Loss B - The loss for the current iteration. NB: There may be multiple loss values (e.g. for face, mask, multi-outputs etc). This value is the sum of all the losses, so the figures here can vary quite widely.
  • Preview Tab
    Visualizes the current state of the model. This is a representation of the model's ability to recreate and swap faces. It updates every time the model saves:
    monitor1.jpg
    monitor1.jpg (111.1 KiB) Viewed 6246 times
    The best way to know if a model has finished training is to watch the previews. Ultimately these show what the actual swap will look like. When you are happy with the previews then it is time to stop training. Fine details like eye-glare and teeth will be the last things to come through. Once these are defined, it is normally a good indication that training is nearing completion.
    • The preview will show 12 columns. The first 6 are the "A" (Original face) side, the second 6 are the "B" (Swap face) side. Each group of 6 columns is split into 2 groups of 3 columns. For each of these columns:
      prev1.png
      prev1.png (74.99 KiB) Viewed 6235 times
      • Column 1 is the unchanged face that is fed into the model
      • Column 2 is the model attempting to recreate that face
      • Column 3 is the model attempting to swap the face
    • These will start out as a solid color, or very blurry, but will improve over time as the NN learns how to recreate and swap faces.
    • The opaque red area indicates the area of the face that is masked out (if training with a mask).
    • If training with coverage less than 100% you will see the edges of a red box. This indicates the "swap area" or the area that the NN is training on.
    • You can save a copy of the current preview image with the save button at the bottom right.
    • The preview can be disabled by unchecking the "Enable Preview" box at the bottom right.
  • Graph Tab - This tab contains a graph that shows loss over time. It updates every time the model saves, but can be refreshed by hitting the "Refresh" button:
    monitor2.jpg
    monitor2.jpg (85 KiB) Viewed 6246 times
    You do not need to pay too close attention to the numbers here. For faceswapping they are effectively meaningless. The numbers give an idea of how well the NN thinks it is recreating Face A and how well it is recreating Face B. However we are interested in how well the model is creating Face B from the encodings of Face A. It is impossible to get a loss value for this as there are no real-world examples of a swapped face for the NN to compare against.

    The loss graph is still a useful tool. Ultimately as long as loss is dropping, then the model is still learning. The rate that the model learns will decrease over time, so towards the end it may be hard to discern if it is still learning at all. See the Analysis tab in these instances.
    • Depending on the number of outputs, there may be several graphs available (e.g. total loss, mask loss, face loss etc). Each graph shows loss for that particular output.
    • You can save a copy of the current graph with the save button at the bottom right.
    • The graph can be disabled by unchecking the "Enable Preview" box at the bottom right.
  • Analysis Tab - This tab shows some statistics for the currently running and previous training sessions:
    monitor3.jpg
    monitor3.jpg (64.32 KiB) Viewed 6246 times
    • The columns are as follows:
      • Graphs - Click the blue graph icon to open up a graph for the selected session.
      • Start/End/Elapsed - The start time, end time and total training time for each session respectively.
      • Batch - The batch size for each session
      • Iterations - The total number of iterations processed for each session.
      • EGs/sec - The number of faces processed through the model per second.
    • When a model is not training, you can open up stats for previously trained models by hitting the open button at the bottom right and selecting a model's state.json file inside the model folder.
    • You can save the contents of the analysis tab to a csv file with the save icon at the bottom right.
    As stated above, the loss graph is useful for seeing if loss is dropping, but it can be hard to discern when the model has been training for a long time. The analysis tab can give you a more granular view.
    • Clicking the blue graph icon next to your latest training session will bring up the training graph for the selected session.
      analysis1.jpg
      analysis1.jpg (54.3 KiB) Viewed 6249 times
    • Select "Show Smoothed", raise the smoothing amount to 0.99, hit the refresh button and then zoom in on last 5,000 - 10,000 iterations or so:
      analysis2.jpg
      analysis2.jpg (101.26 KiB) Viewed 6249 times
    • Now that the graph is zoomed in, you should be able to tell if the loss is still dropping or whether it has "converged". Convergence is when the model is not learning anything any more. In this example you can see that, whilst at first look it may seem the model has converged, on closer inspection, loss is still dropping:
      analysis3.jpg
      analysis3.jpg (183.57 KiB) Viewed 6249 times

Stopping and Resuming

It is possible to stop training at any point just by pressing the Terminate button at the bottom left of the GUI. The model will save it's current state and exit.

Models can be resumed by selecting the same settings and pointing the "model dir" folder at the same location as the saved folder. This can be made much easier by saving your Faceswap config, either from the GUI File Menu or from the save icon below the options panel:
save1.jpg
save1.jpg (24.47 KiB) Viewed 6244 times
save2.jpg
save2.jpg (5.99 KiB) Viewed 6244 times
You can then just reload your config and continue training.

Faces can be added and removed from your Training folders, but make sure that you stop training before making any changes, and then resume again. If you are using Warp to Landmarks or training with a mask, you will need to make sure your alignments file is updated with the new and removed faces.


Recovering a Corrupted Model

Occasionally models corrupt. This can be for any number of reasons, but is evidenced by all of the faces in the preview turning to a solid/garbled color and the loss values spiking to a high number and not recovering.

Faceswap offers a tool for easily recovering a model. Backups are saved every save iteration that the loss values fall overall. These backups can be recovered in the following way:
  • Go to Tools > Restore:
    restore0.png
    restore0.png (4.61 KiB) Viewed 5423 times
  • Model Dir - The folder containing your corrupted model should be entered here:
    restore1.png
    restore1.png (1.34 KiB) Viewed 5423 times
Hit the Restore button. Once restored you should be able to carry on training from your last backup.