Ok, these questions are not easy to answer quickly and concisely, but I will do my best. What I post here is quite overly-simplified to try to get the basic premise in a way that is understandable.
hs1985 wrote: ↑Wed Jun 07, 2023 10:15 pm
Does output size affect the swap quality? Will I get a better swap quality if I use a model that has 256 output size compared to a model with 128 output size?
I'll answer this question first, because it also partly answers some of your questions around input size. The simple answer to this is 'yes', but with major caveats. The biggest impact that outputting at a higher res will have on swap quality is less up-scaling required to transform the final model output onto the final frame. The less up-sizing you can do on the model output, the less blurry the final swap will be (blow up a 64px image to 1024px and it will look significantly worse than resizing a 256px image to 1024px).
However, resolution is just one aspect. As a general rule of thumb, I'd say that for every doubling of resolution, you need 4x the resources, be this time or VRAM etc. It is not linear, so generally you will need trade-offs in other parts of the model to get to this higher resolution. This will negatively impact the final swap, so it is possible (and likely) that you can create a model with a higher resolution output, but contains less detail, as you've had to scale back other parts of the neural network to accommodate this increased resolution. I could most likely create a model that could output at 1024px and run on 4GB VRAM. It would end up being hugely unsatisfactory as I would have had to reduce the model complexity so much that you'd probably just get a pink blob out at the end. A high resolution pink blob, but a pink blob nonetheless.
hs1985 wrote: ↑Wed Jun 07, 2023 10:15 pm
If the input size doesn't affect the swap quality, then why do we have different low and high resolution models that accept different input sizes (64 - 512)? What are the advantages of using a model with large input size over a model with small input size?
So, onto your first question. Input size will impact the final swap quality, but not necessarily in the way that you would think. The best way I can explain it is this.... Think of a deepfaking neural net not as 1 model but 3.
- An encoder. This is responsible for looking at faces and distilling the information down to an abstract encoding. The simplest way I can describe this abstract encoding is as a coded message which conveys information such as "This face is smiling, the eyes are looking left, the face is looking down, it is lit from the top right".
- The input is a face at your desired input resolution
- The output is an abstact encoding. This is called the 'latent space' or 'latent vector'. It is very small (relatively)
- 2 Decoders. 1 for each identity. The responsibility of the decoders are to take this abstract encoding (latent vector) generated by the encoder and to turn this into a reconstructed face image at your desired output size,
- The input is the latent vector generated by the decoder
- The output is the final reconstructed face image
The size of the output of the encoder is, ultimately, controlled by the size of the 'bottleneck'. The bottleneck size is usually fairly uniform regardless of input size of the model (normally 512 or 1024... think of it as a 1x512px or 1x1024px grayscale image or a barcode. This is definitely not what it is, it is more complicated than that, but hopefully helps with internally visualizing). The job of the encoder is to create the best possible abstract encoding(/latent vector/barcode) for the decoders to be able to turn that information into a face.
Therefore it becomes an open question "how much information is useful for this latent vector?". Given that:
- the decoder's job is to turn this encoded information into a face
- the encoded information is abstract in nature
- the encoded information is provided in a format that is orders of magnitude smaller (in terms of space required to store) than the original face
then you end up with diminishing returns at higher resolution, in terms of important information that the neural network will deem important as you increase resolution. It is not just about resolution, it is also about having an encoder that can distil the most important information down to the latent vector, whilst discarding information that is not important for the reconstruction. This is where the encoder's design and complexity comes into it.
So to answer your second question 'What are the advantages of using a model with large input size over a model with small input size?'
I can only give you the unsatisfactory answer of, 'it depends, but resolution is unlikely to be the most important thing'.