NB: This guide was correct at time of writing, but things do change. I will try to keep it updated.
A lot of people get overwhelmed when they start out Faceswapping, and many mistakes are made. Mistakes are good. It's how we learn, but it can sometimes help to have a bit of an understanding of the processes involved before diving in.
In this post I will detail how we train a model. There are several models with many options. I won't cover everything off, but hopefully this will give you enough to make informed decisions of your own. If you have not already generated your face sets and alignments files for training, then stop right now and head over to the Extract Guide to generate them now.
There is quite a lot of background information in this guide. I advise that you familiarize yourself with it all. Machine Learning is a complicated concept, but I have tried to break it down to be as simple to understand as possible. Having a basic understanding about how the Neural Network works, and the kind of data it benefits from seeing will vastly improve your chances of achieving a successful swap.
I will be using the GUI for this guide, but the premise is exactly the same for the cli (all the options present in the GUI are available in the cli).
At a high level, training is teaching our Neural Network (NN) how to recreate a face. Most of the models are largely made up of 2 parts:
Encoder - This has the job of taking a load of faces as an input and "encoding" them into a representation in the form of a "vector". It is important to note that it is not learning an exact representation of every face you feed into it, rather, it is trying to create an algorithm that can be used to later reconstruct faces as closely as possible to the input images.
Decoder - This has the job of taking the vectors created by the encoder and attempting to turn this representation back into faces, as closely matching the input images as possible.
Some models are constructed slightly differently, but the basic premise remains the same.
The NN needs to know how well it is doing encoding and decoding faces. It uses 2 main tools to do this:
Loss - For every batch of faces fed into the model, the NN will look at the face it has attempted to recreate by its current encoding and decoding algorithm and compare it to the actual face that was fed in. Based on how well it thinks it has done, it will give itself a score (the loss value) and will update its weights accordingly.
Weights - Once the model has evaluated how well it has recreated a face it updates its weights. These feed into the Encoder/Decoder algorithms. If it has adjusted its weights in one direction, but feels it has done a worse job of reconstructing the face than previously, then it knows that the weights are moving in the wrong direction, so it will adjust them the other way. If it feels it has improved, then it knows to keep adjusting the weights in the direction that it is going.
The model then repeats this action many, many times constantly updating its weights based on its loss values, theoretically improving over time, until it reaches a point where you feel it has learned enough to effectively recreate a face, or the loss values stop falling.
Now we have the basics of what a Neural Network does and how it learns to create faces, how does this apply to face swapping? You may have noticed in the above breakdown that this NN learns how to take a load of faces of a person and then reconstruct those faces. This isn't what we want though... we want to take a load of faces and reconstruct someone else's face. To achieve this, our NN does a couple of things:
Shared Encoder - When we train our model, we are feeding it 2 sets of faces. The A set (the original faces that we want to replace) and the B set (the swap faces that we wish to place in a scene). The first step to achieve this is sharing the Encoder for both the A and B set. This way our encoder is learning a single algorithm for 2 different people. This is extremely important, as we will ultimately be telling our Neural Network to take the encodings of one face and decode it to another face. The encoder therefore needs to see, and learn, both sets of faces that we require for the swap.
Switched Decoders - When training the model, we train 2 decoders. Decoder A is taking the encoding vectors and attempting to recreate Face A. Decoder B is taking the encoding vectors and attempting to recreate Face B. When it comes to finally swapping faces, we switch the decoders around, therefore we feed the model Face A, but pass it through Decoder B. As the Encoder has been trained on both sets of faces, the model will encode the input Face of A, but then attempt to reconstruct it from Decoder B, resulting in a swapped face being output from our model.
There is some common Machine Learning Terminology that you will see when using Faceswap. To make life simpler, A glossary of terms is shown here:
Batch - A batch is a group of faces that are fed through the Neural Network at the same time.
Batch Size - Batch Size is the size of the batch that is fed through the Neural Network at the same time. A batch size of 64 would mean that 64 faces are fed through the Neural Network at once, then the loss and weights update is calculated for this batch of images. Higher batch sizes will train faster, but will lead to higher generalization. Lower batch sizes will train slower, but will distinguish differences between faces better. Adjusting the batch size at various stages of training can help.
Epoch - An epoch is one complete representation of the data fed through the Neural Network Eg: if you have a folder of 5000 faces, 1 Epoch will be when the model has seen all 5000 faces. 2 Epochs will be when the model has seen all 5000 faces twice, and so on. In terms of Faceswap, Epoch is not actually a useful measure. As a model is trained on 2 data sets (side A and side B) unless these datasets are of exactly the same size (very unlikely), it is impossible to calculate one Epoch as it will be different for each side.
Example - An example, in terms of Faceswap, is another name for "Face". It is basically a single face that is passed through the Neural Network. If the model has seen 10 examples, then it has seen 10 faces.
EG/s - This is the number of examples that the Neural Network sees per second, or in terms of Faceswap, the number of faces that the model is processing every second.
Iteration - An iteration is one complete batch processed through the Neural Network. So 10 iterations at batch size 64 would mean that 640 (64 * 10) faces have been seen by the model.
NN - An abbreviation for Neural Network.
It cannot be overstated enough how important the quality of data is to your model. A smaller model can perform very well with decent data, similarly no model will perform well with poor data. At an absolute minimum there should be 500 varied images in each side of your model, however the more data, and the more varied, the better... up to a point. A sane number of images to use is anywhere between 1,000 and 10,000. Adding many more images than this can actually hurt training.
Too many similar images will not help your model. You want as many different angles, expressions and lighting conditions as possible. It is a common misconception that a model is trained for a specific scene. This is "memorization" and is not what you are trying to achieve. You are trying to train the model to understand a face at all angles, with all expressions in all conditions, and swap it with another face at all angles, with all expressions in all conditions. You therefore want to build a training set from as many different sources as possible for both the A and B set.
Varied angles for each side are highly important. A NN can only learn what it sees. If 95% of the faces are looking straight at the camera and 5% are side on, then it will take a very long time for the model to learn how to create side on faces. It may not be able to create them at all as it sees side on faces so infrequently. Ideally you want as even distribution as possible of face angles, expressions and lighting conditions.
Similarly, it is also important that you have as many matching angles/expressions/lighting conditions as possible between both the A and B sides. If you have a lot of profile images for the A side, and no profile images for the B side, then the model will never be able to perform swaps in profile, as Decoder B will lack the information required to create profile shots.
The quality of training data should generally not be obscured and should be of a high quality (sharp and detailed). However, it is fine to have some images in the training set that are blurry/partially obscured. Ultimately in the final swap some faces will be blurry/low resolution/obscured, so it is important for the NN to see these types of images too so it can do a faithful recreation.
More detailed information about creating training sets can be found in the Extract guide.
There are several models available in Faceswap and more will be added over time. The quality of each one can be highly subjective so this will provide a brief overview of each (currently) available. Ultimately the model which works best for you can come down to many factors, so there is no definitive answer. There are pros and cons of each, however as stated above, the single most important factor is the quality of data. No model will fix data issues.
You will see mention below of input and output sizes (e.g. 64px input, 64px output). This is the size of the face image that is fed to the model (input) and the size of the face that is generated from the model (output), All faces fed to the models are square, so a 64px image will be 64 pixels wide by 64 pixels high. It is a common misconception that higher resolution inputs will lead to better swaps. Whilst it can help, this is not always the case. The NN is learning how to encode the face into an algorithm and then decode that algorithm again. It only needs enough data to be able to create a solid algorithm. Input resolution and output quality are not directly linked.
It is worth noting that the larger the model, the longer it will take to train. The original model can take anywhere from 12-48 hours to train on a Nvidia GTX 1080. Villain can take over a week on the same hardware. It is commonly thought that a model with double the input size will take twice as long. This is incorrect. It will take at least 4 times as long, and probably longer. This is because a 64px image has 4,096 pixels. However a 128px image has 16,384 pixels. This is 4 times as many, add to this that a model needs to be scaled to handle this increased volume of data and training time can quickly stack up.
Lightweight (64px input, 64px output) - This is an extremely stripped down model, designed to run on GPUs with <=2GB of VRAM. It is not what you would call "production ready", but it enables users with lower end hardware to train a model. On higher end GPUs it will train very quickly, so it can be useful for quickly seeing how well a swap might work prior to moving to a more taxing model.
Original (64px input, 64px output) - The model that started it all. Still can provide great results and useful to understand how your dataset quality is really one of the biggest drivers of swap quality.
IAE (64px input, 64px output) - A model with a slightly different structure to the other models. It has a shared Encoder and a shared Decoder, but 3 Intermediate layers (one for A, one for B and one shared) which sit between the Encoder and Decoder. It is structured this way to try to better separate identity. More can be read about this model here: https://github.com/deepfakes/faceswap/pull/251
Dfaker (64/128px input, 128/256px output) - This model leverages some different techniques from the original model, and also focuses on upscaling an input to a higher resolution output. Despite being around for a while, this model still achieves great results, whilst its lack of customization options makes it an easy 'fire and forget' model.
Unbalanced (64-512px input, 64-512px output) - This is a powerful model that has a lot of ways to customize and improve the model but requires a bit more expertise and know-how to get good results. Has arguably been superseded by "RealFace". It is worth noting that this model puts more emphasis on the B Decoder, so that reversing a swap (i.e. swapping B>A rather than A>B) will lead to less satisfactory results.
DFL-H128 (128px input, 128px output) - This model actually uses the exact same encoder and decoder as Original but then uses a 128px input instead of 64px and then tries to compress the image into a representation of the face half as large as Original. The smaller 'latent space' has some downsides in quality vs. Original that negate the larger input size.
DFL-SAE (64-256px input, 64-256px output) - This model contains two different network structures within it, one based on the Original shared encoder/split decoders model and one based on the IAE Model (shared intermediate layers). Has numerous customization options. Gives good detail, but can lead to some identity bleed (that is, some features of A may still be visible in B).
Villain (128px input, 128px output) - Villain is likely the most detailed model but very VRAM intensive and can give sub-par color matching when training on limited sources. Is the source of the viral Steve Buscemi/Jennifer Lawrence Deepfake. As this model does not have any customization options (beyond a low memory variant) it is a decent choice if you want a higher resolution model without having to adjust any settings.
Realface (64-128px input, 64-256px output) - The successor to the Unbalanced model. Takes learnings from that model and Dfaker, whilst looking to develop them further. This model is highly customizable, but it's best to tweak the options when you have some idea of what you are doing and what impact the settings will have. As with the unbalanced model this model puts more emphasis on the B Decoder, so that reversing a swap (i.e. swapping B>A rather than A>B) will lead to less satisfactory results.
Dlight (128px input, 128-384px output) - A higher resolution model based on the dfaker variant, focusing on upscaling the faces, with custom upscalers. This is the newest model and is very easily configurable.
Ok, you've chosen your model, let's get Training! Well, hold up there. I admire your eagerness, but you are probably going to want to set some model specific options first. I will be using the GUI for this, but the config file (if using the command line) can be found in your faceswap folder at the location faceswap/config/train.ini.
I won't go into each individual model's options as these are varied and keeping it updated for new models will be tough, but I will give an overview of some of the more common options. We will focus more on the global options which apply to all models. All of the options have tooltips, so hover over an option to get more information about what it does.
To access the model configuration panel, go to Settings > Configure Settings... or select the Train Settings shortcut to be taken straight to the correct place:
These are options that apply to all models and is split into Global model options and Loss options.
The global options can be accessed by selecting the "Model" node. All of the options on this page, with the exception of Learning Rate, Convert Batchsize and Allow Growth only take effect when creating a new model. Once you start training a model, the settings chosen here are "locked" to that model, and will be reloaded whenever you resume training, regardless of what is configured here.
Options that apply to the face being fed into the model
Centering - The face to be trained on will be cropped from your extraction image.
Legacyis the traditional training method, but it tends to crop fairly close into the face (chopping off the forehead).
FaceZooms out the face a bit and re-centers closer to the center of the head to better catch more angles. If you are training on a model with an output size of less than 128px, then you should probably select
Legacy, otherwise select the centering that will work best for your project.
The below image shows an example of an extracted image. The yellow box shows the location of the
Legacycentered crop and the green box shows the location of the
Coverage - This is the amount of the source image that will be fed into the model. A percentage of the image is cropped from the center by the amount given. The higher the coverage percentage, the more of the face will be fed in. An illustration of the amount of image that is cropped is shown below.
The left-hand image is with
Legacycentering and the right hand image is with
Generally you would not want to set
Facecoverage below 75% as it centers the face differently from
Legacyand may lead to issues. Any value above 62.5% is fine for
Whilst, intuitively, it may seem that higher coverage is always better, this is not actually the case, it is a trade-off. Whilst a higher coverage will mean more of the face gets swapped, the input size of the model always remains the same, so the resulting swap will likely be less detailed, as more information needs to be packed into the same size image. To illustrate, below is an extreme example of the same image with 62.5% coverage and 100% coverage, both sized to 32px. As you can see the 100% coverage image contains far less detail than the 62.5% version. Ultimately the right choice for this option is up to you:
Facecentered faces are already more zoomed out than
Legacycentered faces, so will collect less detail. It is best to use
Legacycentering for models with less than 128px output.
Options that pertain to the optimizer.
The optimizer controls how the model learns and the rate it will learn at..
Optimizer - The learning method to use. Leaving this at the default "Adam" is probably the best option, but you can feel free to experiment with different optimizers. It's beyond the scope to describe each method listed, but there is plenty of information on the internet.
Learning Rate - This should generally be left alone, except in instances where the model is collapsing (all the images change to a solid color block, and the loss spikes to a high level and never recovers). Unlike the other parameters on this page, this value can be adjusted for existing models.
The learning rate dictates how far weights can be adjusted up or down at each iteration. Intuition would say a higher learning rate is better, but this is not the case. The model is trying to learn to get to the lowest loss value possible. A learning rate set too high will constantly swing above and below the lowest value and will never learn anything. Set the learning rate set too low and the model may hit a trough and think it has reached it's lowest point, and will stop improving.
Think of it as walking down a mountain. You want to get to the bottom, so you should always be going down. However, the way down the mountain is not always downhill, there are smaller hills and valleys on the way. The learning rate needs to be high enough to be able to get out of these smaller valleys, but not so high that you end up on top of the next mountain.
Options for when the model is used for converting
These options only impact the convert process, not the training process. As such, they can be changed for existing models.
Convert Batchsize - The number of faces to feed through the model during conversion. You don't generally need to change this unless you are running out of VRAM when running convert.
Options that apply to initializing your model.
As discussed in the Training overview the model has weights which get updated at the end of each iteration. Initialization is the process by which these weights are first set. You can see this as giving the model a helping hand getting started. As we know what our NN is going to be used for, we can set these weights at values which will help the model have a jump start in life.
The default method for initialization is "he_uniform". This draws samples from a uniform distribution. It is not this guide's goal to go into different initialization methods, and what they mean, but this default can be overridden by the options exposed within this section.
It should be noted that some models set their initialization method internally for some layers of the model, so these layers will not be impacted by this setting. However, for layers and models where this has not been explicitly set, the initializer will be changed to the selected option.
Both of the existing initializers can be used together (i.e. they can both be enabled with no ill effects). I tend to enable them both.
ICNR Init - This initializer is only applied to upscale layers. Standard initialization can lead to "checkerboard" artefacts in the output images when they get upscaled within a NN. This initializer seeks to prevent these artefacts. More information can be read about this method in this paper: https://arxiv.org/abs/1707.02937
Conv Aware Init - Convolutional Aware Initialization is applied to all convolutional layers within the model. The premise behind this initializer is that it takes into account the purposes of the convolutional network and initializes weights accordingly. Theoretically this leads to higher accuracy, lower loss and faster convergence. More about this initializer can be read in the following paper: https://arxiv.org/abs/1702.06295
NB: This initializer can take more VRAM when it starts, so it is advised to start with a lower batch size, start the model, then restart the model with your desired batch size.
NB: This initializer won't run with multi-gpu mode enabled, so if training with multiple GPUs, you should commence training on 1 GPU, stop the model, then continue with multi-gpu enabled.
Options that apply to layers within the model
The options here apply to some of the layers used within the model.
Reflect Padding - Some models, notably Villain, and to a lesser extent DFL-SAE, have a notable 'gray box' around the edge of the swap area in the final swap. This option changes the type of padding used in the convolutional layers to help mitigate this artefact. I only recommend enabling it for those two models, otherwise I would leave this off.
Allow Growth - [Nvidia Only] Only enable this if you receive persistent errors that cuDNN couldn't be initialized. This happens to some people with no real pattern. This option allocates VRAM as it is required, rather than allocating all at once. Whilst this is safer, it can lead to VRAM fragmentation, so leave it off unless you need it. This option can be changed for existing models.
Mixed Precision - [Nvidia Only] Optimizes some calculations to be able to load larger models and use larger batch sizes. On RTX+ cards this can also speed up training. Enable it if you need it.
Loss - Options to control the type and amount of Loss to use. Access these options by expanding the "Global" node and selecting "Loss":
Just like the Global settings, most of these options become "locked" to a model once you start training. This is with the exception of the eye and mouth multipliers, which can be adjusted for existing models.
The loss functions to use.
There are various different ways to calculate Loss, or for a NN to discern how well it is doing at training a model. I won't go into details of each of the functions available, as it would be a somewhat lengthy process and there is plenty of information about each of these functions on the internet.
Loss Function - The most popular loss methods to use are MAE (Mean Absolute Error) and SSIM (Structural Similarity). My personal preference is to use SSIM.
Mask Loss Function - The method to use if you intend to learn a mask. The default Mean Squared Error (MSE) is fine.
L2 Reg Term - If using a structural similarity loss (e.g. SSIM,) then an L2 Regularizer should be applied. This option only effects compatible losses and should be left at the default setting unless you know what you are doing.
Eye Multiplier - The amount of "weight" to give to the eye area. This option can only be used if "Penalized Loss" has been enabled. A value of 2 will mean that the eyes are given twice the importance of the rest of the face. This can help learn fine details, but setting this too high can lead to pixelization. As this setting can be adjusted for existing models, issues can generally be trained out.
Mouth Multiplier - The amount of "weight" to give to the mouth area. This option can only be used if "Penalized Loss" has been enabled. A value of 2 will mean that the mouth is given twice the importance of the rest of the face. This can help learn fine details, but setting this too high can lead to pixelization. As this setting can be adjusted for existing models, issues can generally be trained out.
Penalized Mask Loss - This option dictates whether areas of the image that fall outside of the face area should be given less importance than areas that fall inside the face area. This option should always be enabled
Options that apply to training with a mask.
Setting a mask is a way of indicating which area of an image is important. In the below example, the red area is "masked out", (ie: it is considered unimportant), whilst the clear area is "masked in" (It is the face so the area we are interested in):
Training with a mask can serve two purposes:
It focuses the training on the face area, forcing the model to provide less importance to the background. This can help the model learn faster, whilst also ensuring that it is not taking up space learning background details that are not important.
The learned mask can be used in the conversion stage. It is arguable whether learned masks, in the current implementation, offer any benefit over using a standard mask when converting, but training with a mask ensures you have the option to use it.
NB: If you are training with a mask, then an alignments file must be provided in each of the input folders for the all of the faces in both A and B.
Mask Type- The type of mask to be used for training. To use a mask you must have added the mask you require to the alignments file. You can add/update masks with the mask tool. See viewtopic.php?f=5&t=27#extract for a thorough description of each of the masks.
Mask Blur Kernel - This applies a slight blur to the edge of the mask. In effect it removes the hard edges of the mask and blends it more gradually from face to background. This can help with poorly calculated masks. It's up to you whether you want to enable this or not and what value to use. Defaults should be fine, but you can use the Mask Tool to experiment.
Mask Threshold - This option will not impact alignments based masks (extended, components) as they are binary (IE the mask is either "on" or "off"). For NN based masks, the mask is not binary and has different levels of opacity. This can lead to the mask being blotchy in some instances. Raising the threshold makes parts of the mask that are near transparent, totally transparent, and parts of the mask that are near solid, totally solid. Again, this will vary on a case by case basis.
Learn Mask - As stated earlier, it is arguable whether there are any benefits to learning a mask. Enabling this option will use more VRAM, so I tend to leave it off, but if you want the predicted mask to be available in convert, then you should enable this option.
These are settings that are specific to each model plugin. You can access these by clicking on the "Model" node:
As mentioned before, I won't cover model specific settings in any detail. These vary from plugin to plugin. However, I will cover some common options that you may see in each of the plugins. As always, each option will have a tooltip which can give you more information.
lowmem - Some plugins have a "lowmem" mode. This enables you to run a stripped down version of the model, taking up less VRAM, but at the cost of worse fidelity.
input size - Some plugins allow you to adjust the size of the input being fed into the model. Inputs are always square, so this is the size, in pixels, of the width and height of the image that is fed into the model. Don't believe that larger input always equals better quality. This is not always the case. There are many other factors that determine whether a model will be of a decent quality. Higher input sizes take exponentially more VRAM to process.
output size - Some plugins allow you to adjust the size of the image being generated by the model. Input size and output size does not have to be the same, so some models contain upscalers that return a larger output image than input image.
The section in the configuration settings page is for Trainer, or "Data Augmentation" options. Expand the "Trainer" node and select "Original":
A NN needs to see many, many different images. In order to better learn a face it performs various manipulations on the input images. This is called "data augmentation". As mentioned in the notes, the standard settings will be fine for 99% of use cases, so only change these if you know what impact they will have.
Evaluation - Options for evaluating the status of training.
Preview Images - This is the number of faces that are shown in the preview window for each of the A and B sides of the swap.
Color Augmentation - These augmentations manipulate the color/contrast of faces being fed into the model, to make the NN more robust to color differences.
This is an illustration of what color augmentation does under the hood (you will not see this in your previews/final output, it is just for demonstration purposes):
Color Lightness - Percentage amount that the lightness of the input images are adjusted up and down. Helps to deal with different lighting conditions.
Color AB - Percentage amount that the color is adjusted on the A/B scale of the L*ab color space. Helps the NN to deal with different color conditions.
Color CLAHE Chance - The percentage chance that the image will have Contrast Limited Adaptive Histogram Equalization applied to it. CLAHE is a contrast method that attempts to localize contrast changes. This helps the NN deal with differing contrast amounts.
Color CLAHE Max Size - The maximum "grid size" of the CLAHE algorithm. This is scaled to the input image. A higher value will lead to higher contrast application. This helps the NN deal with differing contrast amounts.
Image Augmentation - These are manipulations that are performed on faces being fed into the model.
Zoom Amount - Percentage amount that the face is zoomed in or out before being fed into the NN. Helps the model to deal with misalignments.
Rotation Range - Percentage amount that the face is rotated clockwise or anticlockwise before being fed into the NN. Helps the model to deal with misalignments.
Shift Range - Percentage amount that the face is shifted up/down, left/right before being fed into the NN. Helps the model to deal with misalignments.
Flip Chance - Chance of flipping the face horizontally. Helps create more angles for the NN to learn from.
Disable Warp - Warping the image is extremely important for the Neural Network to learn properly. However, when you are coming towards the end of a training session, it can help to turn this off to try to bring out finer details.
Once you have you model settings as you want them, hit OK to save the configuration and close the window.
NB: Hitting OK will save the options on all tabs, so make sure you review them carefully. You can hit Cancel to cancel any of your changes or Reset to revert all values to their default setting.
Now you've got your faces in place, you have configured your model, it's time to kick things off!
Head over to the Train tab in the GUI:
This is where we will tell Faceswap where everything is stored, what we want to use, and actually start training.
This is where we will tell Faceswap where the faces are stored, and the location of their respective alignments files (if required)
Input A - This is the location of the folder that contains your "A" faces that you extracted as part of the Extraction process. These are the faces that will be removed from the original scene to be replaced by your swap faces. There should be around 1,000 - 10,000 faces in this folder.
Alignments A - You will require an alignments file for the faces you intend to train on. This is mandatory and the process will fail without this file. It tells the training process how to crop the Extracted Faces into Training Images as well as containing masks and other related data. If the file exists in your faces folder and is named alignments.fsa then it will automatically be picked up be the process. It is imperative that every single face in the faces A folder has an entry in the alignments file, otherwise training will fail. You may need to merge several alignments files. You can find more information about preparing an alignments file for training in the Extract Guide.
Input B - This is the location of the folder that contains your "B" faces that you extracted as part of the Extraction process. These are the faces that will be swapped onto the scene. There should be around 1,000 - 10,000 faces in this folder.
Alignments B - You will require an alignments file for the faces you intend to train on. This is mandatory and the process will fail without this file. It tells the training process how to crop the Extracted Faces into Training Images as well as containing masks and other related data. If the file exists in your faces folder and is named alignments.fsa then it will automatically be picked up be the process. It is imperative that every single face in the faces A folder has an entry in the alignments file, otherwise training will fail. You may need to merge several alignments files. You can find more information about preparing an alignments file for training in the Extract Guide.
Options pertaining to the model that you will be training with:
Model Dir - This is where the model files will be saved. You should select an empty folder if you are starting a new model, or an existing folder containing the model files if you are resuming training from a model you have already started.
Trainer - This is the model you are going to train for your swap. An overview of the different models is available above.
Training specific settings:
Batch Size - As explained above, the batch size is the amount of images fed through the model at once. Increasing this figure will increase VRAM usage. Increasing batch sizes will speedup training to a certain point. small batch sizes provide a form of regulation that helps model generalization. while large batches train faster, batch sizes in the 8 to 16 range likely produce better quality. It is still an open question on whether other forms of regulation can replace or eliminate this need.
Iterations - The number of iterations to perform before automatically stopping training. This is really only here for automation, or to make sure training stops after a certain amount of time. Generally you will manually stop training when you are happy with the quality of the previews.
Distributed - [NVIDIA ONLY] - Enable Tensorflow's Mirrored Distribution strategy for training on Multiple GPUs. If you have multiple GPUs in your system, then you can utilize them to speed up training. Do note that this speed up is not linear, and the more GPUs you add, the more diminishing returns will kick in. Ultimately it allows you to train bigger batch sizes by splitting them across multiple GPUs. You will always be bottlenecked by the speed and VRAM of your weakest GPU, so this works best when training on identical GPUs. You can read more about Tensorflow Distribution Strategies here.
No Logs - Loss and model logging is provided to be able to analyse data in TensorBoard and the GUI. Turning this off will mean you do not have access to this data. Realistically, there is no reason to disable logging, so generally this should not be checked.
Options to schedule saving of model files:
Save Interval - How often the model should be saved out to disk. When a model is being saved, it is not being trained, so you can raise this value to get a slight speed boost on training (i.e. it is not waiting for the model to be written out to disk as often). You probably don't want to raise this too high, as it is basically your "failsafe". If a model crashes during training, then you will only be able to continue from the last save.
NB: If using the Ping Pong memory saving option, you should not increase this above 100 as it is likely to be detrimental to the final quality.
Snapshot Interval - Snapshots are a copy of the model at a point in time. This enables you to rollback to an earlier snapshot if you are not happy with how the model is progressing, or to rollback if your save file has corrupted and there are no backups available. Generally, this should be a high number (the default should be fine in most instances) as creating snap shots can take a while and your model will not be training whilst this process completes.
Options for displaying the training progress preview window:
If you are using the GUI then generally you won't want to use these options. The preview is a window which is popped open that shows the progress of training. The GUI embeds this information in the "Display" panel, so the popped-out window will just show exactly the same information and is redundant. The preview updates at each save iteration.
Preview Scale -The popped out preview is sized to the size of the training images. If your training images are 256px, then the full preview window will be 3072x1792. This will be too big for most displays, so this option scales the preview down by the given percentage.
Preview - Enable to pop the preview window, disable to not pop the preview window. For GUI use generally leave this unchecked.
Write Image - This will write the preview image to the faceswap folder. Useful if training on a headless system.
Options for generating an optional set of time-lapse images:
The Time Lapse option is an optional feature that enables you to see the progress of training on a fixed set of faces over time. At each save iteration an image will be saved out showing the progress of training on your selected faces at that point in time. Be aware that the amount of space on disk that Time Lapse images take can stack up over time.
Timelapse Input A - A folder containing the faces you want to use for generating the time lapse for the A (original) side. Only the first 14 faces found will be used. You can just point this at your "Input A" folder if you wish to select the first 14 faces from your training set.
Timelapse Input B - A folder containing the faces you want to use for generating the time lapse for the B (swap) side. Only the first 14 faces found will be used. You can just point this at your "Input B" folder if you wish to select the first 14 faces from your training set.
Timelapse Output - The location where you want the generated time lapse images to be saved. If you have provided sources for A and B but leave this blank, then this will default to your selected model folder.
Data augmentation specific options.
Warp To Landmarks - As explained earlier the data is warped so that the NN can learn how to recreate faces. Warp to Landmarks is a different warping method, that attempts to randomly warp faces to similar faces from the other side (i.e. for the A set, it finds some similar faces from the B set, and applies a warp, with some randomization). The jury is out on whether this offers any benefit/difference over standard random warping.
No Flip - Images are randomly flipped to help increase the amount of data that the NN will see. In most instances this is fine, but faces are not symmetrical so for some targets, this may not be desirable (e.g. a mole on one side of the face). Generally this should be left unchecked, and certainly should be left unchecked when commencing training. Later during the session you may want to disable this for some swaps.
No Augment Color - Faceswap performs color augmentation (detailed earlier). This really helps matching color/lighting/contrast between A and B, but sometimes may not be desirable, so it can be disabled here. The impact of color augmentation can be seen below:
Global Faceswap Options:
These options are global to every part of Faceswap, not just training.
Exclude GPUs - If you have multiple GPUs, you can choose to hide them from Faceswap. Hover over the item to see which GPU each index applies to.
Configfile - You can specify a custom train.ini file rather than using the file stored in the faceswap/config folder. This can be useful if you have several different known good configurations that you like to switch between.
Loglevel - The level that Faceswap will log at. Generally, this should always be set to INFO. You should only set this to TRACE if a developer has asked you to, as it will significantly slow down training and will generate huge log files.
NB: The console will only ever log up to the level VERBOSE. Log levels of DEBUG and TRACE are only written to the log file.
Logfile - By default the log file is stored at faceswap/faceswap.log. You can specify a different location here if you wish.
Once you have all your settings locked in, review them to make sure you are happy and hit the Train button to start training.
Once you have started training the process will take a minute or two to build the model, pre-load the data and start training. Once it has started, the GUI will enter Training mode, placing a status bar at the bottom and opening up some tabs on the right hand side:
This appears at the bottom right and gives an overview of the current training session. It updates every iteration:
You do not need to pay too close attention to the loss numbers here. For faceswapping they are effectively meaningless. The numbers give an idea of how well the NN thinks it is recreating Face A and how well it is recreating Face B. However we are interested in how well the model is creating Face B from the encodings of Face A. It is impossible to get a loss value for this as there are no real-world examples of a swapped face for the NN to compare against.
Elapsed - The amount of time that has elapsed for this training session.
Session Iterations - The number of iterations that have been processed during this training session.
Total Iterations - The total number of iterations that have been processed for all sessions for this model.
Loss A/Loss B - The loss for the current iteration. NB: There may be multiple loss values (e.g. for face, mask, multi-outputs etc). This value is the sum of all the losses, so the figures here can vary quite widely.
Visualizes the current state of the model. This is a representation of the model's ability to recreate and swap faces. It updates every time the model saves:
The best way to know if a model has finished training is to watch the previews. Ultimately these show what the actual swap will look like. When you are happy with the previews then it is time to stop training. Fine details like eye-glare and teeth will be the last things to come through. Once these are defined, it is normally a good indication that training is nearing completion.
The preview will show 12 columns. The first 6 are the "A" (Original face) side, the second 6 are the "B" (Swap face) side. Each group of 6 columns is split into 2 groups of 3 columns. For each of these columns:
Column 1 is the unchanged face that is fed into the model
Column 2 is the model attempting to recreate that face
Column 3 is the model attempting to swap the face
These will start out as a solid color, or very blurry, but will improve over time as the NN learns how to recreate and swap faces.
The opaque red area indicates the area of the face that is masked out (if training with a mask).
If training with coverage less than 100% you will see the edges of a red box. This indicates the "swap area" or the area that the NN is training on.
You can save a copy of the current preview image with the save button at the bottom right.
The preview can be disabled by unchecking the "Enable Preview" box at the bottom right.
Graph Tab - This tab contains a graph that shows loss over time. It updates every time the model saves, but can be refreshed by hitting the "Refresh" button:
You do not need to pay too close attention to the numbers here. For faceswapping they are effectively meaningless. The numbers give an idea of how well the NN thinks it is recreating Face A and how well it is recreating Face B. However we are interested in how well the model is creating Face B from the encodings of Face A. It is impossible to get a loss value for this as there are no real-world examples of a swapped face for the NN to compare against.
The loss graph is still a useful tool. Ultimately as long as loss is dropping, then the model is still learning. The rate that the model learns will decrease over time, so towards the end it may be hard to discern if it is still learning at all. See the Analysis tab in these instances.
Depending on the number of outputs, there may be several graphs available (e.g. total loss, mask loss, face loss etc). Each graph shows loss for that particular output.
You can save a copy of the current graph with the save button at the bottom right.
The graph can be disabled by unchecking the "Enable Preview" box at the bottom right.
Analysis Tab - This tab shows some statistics for the currently running and previous training sessions:
The columns are as follows:
Graphs - Click the blue graph icon to open up a graph for the selected session.
Start/End/Elapsed - The start time, end time and total training time for each session respectively.
Batch - The batch size for each session
Iterations - The total number of iterations processed for each session.
EGs/sec - The number of faces processed through the model per second.
When a model is not training, you can open up stats for previously trained models by hitting the open button at the bottom right and selecting a model's state.json file inside the model folder.
You can save the contents of the analysis tab to a csv file with the save icon at the bottom right.
As stated above, the loss graph is useful for seeing if loss is dropping, but it can be hard to discern when the model has been training for a long time. The analysis tab can give you a more granular view.
Clicking the blue graph icon next to your latest training session will bring up the training graph for the selected session.
Select "Show Smoothed", raise the smoothing amount to 0.99, hit the refresh button and then zoom in on last 5,000 - 10,000 iterations or so:
Now that the graph is zoomed in, you should be able to tell if the loss is still dropping or whether it has "converged". Convergence is when the model is not learning anything any more. In this example you can see that, whilst at first look it may seem the model has converged, on closer inspection, loss is still dropping:
It is possible to stop training at any point just by pressing the Terminate button at the bottom left of the GUI. The model will save it's current state and exit.
Models can be resumed by selecting the same settings and pointing the "model dir" folder at the same location as the saved folder. This can be made much easier by saving your Faceswap config, either from the GUI File Menu or from the save icon below the options panel:
You can then just reload your config and continue training.
Faces can be added and removed from your Training folders, but make sure that you stop training before making any changes, and then resume again. If you are using Warp to Landmarks or training with a mask, you will need to make sure your alignments file is updated with the new and removed faces.
Occasionally models corrupt. This can be for any number of reasons, but is evidenced by all of the faces in the preview turning to a solid/garbled color and the loss values spiking to a high number and not recovering.
Faceswap offers a tool for easily recovering a model. Backups are saved every save iteration that the loss values fall overall. These backups can be recovered in the following way:
Go to Tools > Restore:
Model Dir - The folder containing your corrupted model should be entered here:
Hit the Restore button. Once restored you should be able to carry on training from your last backup.