Questions regarding input & output size

hs1985 · Post by **hs1985** » Wed Jun 07, 2023 10:15 pm

Hello,

I am trying to understand the input size (Size of the face image that is fed to the model) and the output size (Size of the face that is generated from the model).

The training documentation says "It is a common misconception that higher resolution inputs will lead to better swaps. Whilst it can help, this is not always the case. The NN is learning how to encode the face into an algorithm and then decode that algorithm again. It only needs enough data to be able to create a solid algorithm. Input resolution and output quality are not directly linked." If the input size doesn't affect the swap quality, then why do we have different low and high resolution models that accept different input sizes (64 - 512)? What are the advantages of using a model with large input size over a model with small input size?

Does output size affect the swap quality? Will I get a better swap quality if I use a model that has 256 output size compared to a model with 128 output size?

MaxHunter · Post by **MaxHunter** » Thu Jun 08, 2023 9:23 pm

Good question!

Post by **torzdf** » Fri Jun 09, 2023 10:50 am

Ok, these questions are not easy to answer quickly and concisely, but I will do my best. What I post here is quite overly-simplified to try to get the basic premise in a way that is understandable.

hs1985 wrote: ↑Wed Jun 07, 2023 10:15 pm
Does output size affect the swap quality? Will I get a better swap quality if I use a model that has 256 output size compared to a model with 128 output size?

I'll answer this question first, because it also partly answers some of your questions around input size. The simple answer to this is 'yes', but with major caveats. The biggest impact that outputting at a higher res will have on swap quality is less up-scaling required to transform the final model output onto the final frame. The less up-sizing you can do on the model output, the less blurry the final swap will be (blow up a 64px image to 1024px and it will look significantly worse than resizing a 256px image to 1024px).

However, resolution is just one aspect. As a general rule of thumb, I'd say that for every doubling of resolution, you need 4x the resources, be this time or VRAM etc. It is not linear, so generally you will need trade-offs in other parts of the model to get to this higher resolution. This will negatively impact the final swap, so it is possible (and likely) that you can create a model with a higher resolution output, but contains less detail, as you've had to scale back other parts of the neural network to accommodate this increased resolution. I could most likely create a model that could output at 1024px and run on 4GB VRAM. It would end up being hugely unsatisfactory as I would have had to reduce the model complexity so much that you'd probably just get a pink blob out at the end. A high resolution pink blob, but a pink blob nonetheless.

hs1985 wrote: ↑Wed Jun 07, 2023 10:15 pm
If the input size doesn't affect the swap quality, then why do we have different low and high resolution models that accept different input sizes (64 - 512)? What are the advantages of using a model with large input size over a model with small input size?

So, onto your first question. Input size will impact the final swap quality, but not necessarily in the way that you would think. The best way I can explain it is this.... Think of a deepfaking neural net not as 1 model but 3.

An encoder. This is responsible for looking at faces and distilling the information down to an abstract encoding. The simplest way I can describe this abstract encoding is as a coded message which conveys information such as "This face is smiling, the eyes are looking left, the face is looking down, it is lit from the top right".
- The input is a face at your desired input resolution
- The output is an abstact encoding. This is called the 'latent space' or 'latent vector'. It is very small (relatively)
2 Decoders. 1 for each identity. The responsibility of the decoders are to take this abstract encoding (latent vector) generated by the encoder and to turn this into a reconstructed face image at your desired output size,
- The input is the latent vector generated by the decoder
- The output is the final reconstructed face image

The size of the output of the encoder is, ultimately, controlled by the size of the 'bottleneck'. The bottleneck size is usually fairly uniform regardless of input size of the model (normally 512 or 1024... think of it as a 1x512px or 1x1024px grayscale image or a barcode. This is definitely not what it is, it is more complicated than that, but hopefully helps with internally visualizing). The job of the encoder is to create the best possible abstract encoding(/latent vector/barcode) for the decoders to be able to turn that information into a face.

Therefore it becomes an open question "how much information is useful for this latent vector?". Given that:

the decoder's job is to turn this encoded information into a face
the encoded information is abstract in nature
the encoded information is provided in a format that is orders of magnitude smaller (in terms of space required to store) than the original face

then you end up with diminishing returns at higher resolution, in terms of important information that the neural network will deem important as you increase resolution. It is not just about resolution, it is also about having an encoder that can distil the most important information down to the latent vector, whilst discarding information that is not important for the reconstruction. This is where the encoder's design and complexity comes into it.

So to answer your second question 'What are the advantages of using a model with large input size over a model with small input size?' I can only give you the unsatisfactory answer of, 'it depends, but resolution is unlikely to be the most important thing'.

hs1985 · Post by **hs1985** » Fri Jun 09, 2023 3:51 pm

torzdf wrote: ↑Fri Jun 09, 2023 10:50 am
As a general rule of thumb, I'd say that for every doubling of resolution, you need 4x the resources, be this time or VRAM etc. It is not linear, so generally you will need trade-offs in other parts of the model to get to this higher resolution. This will negatively impact the final swap, so it is possible (and likely) that you can create a model with a higher resolution output, but contains less detail, as you've had to scale back other parts of the neural network to accommodate this increased resolution. I could most likely create a model that could output at 1024px and run on 4GB VRAM. It would end up being hugely unsatisfactory as I would have had to reduce the model complexity so much that you'd probably just get a pink blob out at the end. A high resolution pink blob, but a pink blob nonetheless.

Does a model trade-off details (Or other things) to accommodate high resolution only if you don't have enough VRAM or does it always scale back other parts of the NN as you increase the output size? Would you get satisfactory results with large output size (256 - 512) if you train a model on a GPU with high VRAM (RTX 4090 or 3090) or will it still scale back other parts of the NN?

torzdf wrote: ↑Fri Jun 09, 2023 10:50 am
Therefore it becomes an open question "how much information is useful for this latent vector?". Given that:
Code: Select all
the decoder's job is to turn this encoded information into a face
the encoded information is abstract in nature
the encoded information is provided in a format that is orders of magnitude smaller (in terms of space required to store) than the original face
then you end up with diminishing returns at higher resolution, in terms of important information that the neural network will deem important as you increase resolution. It is not just about resolution, it is also about having an encoder that can distil the most important information down to the latent vector, whilst discarding information that is not important for the reconstruction. This is where the encoder's design and complexity comes into it.

So, a large input size might improve the distillation process and help the encoder to output/return a better abstract encoding (Depending on encoder's design and complexity). Is that correct?

Post by **torzdf** » Sat Jun 10, 2023 11:01 am

Ok, a lot of answers to these are going to be 'it depends'

hs1985 wrote: ↑Fri Jun 09, 2023 3:51 pm
Does a model trade-off details (Or other things) to accommodate high resolution only if you don't have enough VRAM or does it always scale back other parts of the NN as you increase the output size? Would you get satisfactory results with large output size (256 - 512) if you train a model on a GPU with high VRAM (RTX 4090 or 3090) or will it still scale back other parts of the NN?

The model itself doesn't scale back anything. When you are designing a model, considerations by the developer have to be made about what to scale and where to get the optimal model that will fit on your GPU. On top of that, there is constant research going into design considerations that can produce better/more efficient models.

So, a large input size might improve the distillation process and help the encoder to output/return a better abstract encoding (Depending on encoder's design and complexity). Is that correct?

It might, or it might not. It is an open question as to whether a model that has been encoded with a 128px input is any better than a model that is encoded with a 512px input. The only real way to find out is to experiment. These experiments take time, so any information that you find and can be reported back would be appreciated. But this does feed back into the design considerations from earlier. Can you save VRAM on the encoder, to keep more complex decoders when upping resolution (and thus, VRAM) requirements? Or is it better to have a larger encoder and scale back on the decoder. My assumption is that the former is better, but it's just a guess at this point.

hs1985 · Post by **hs1985** » Mon Jun 12, 2023 9:53 pm

torzdf wrote: ↑Sat Jun 10, 2023 11:01 am
Ok, a lot of answers to these are going to be 'it depends'

hs1985 wrote: ↑Fri Jun 09, 2023 3:51 pm
Does a model trade-off details (Or other things) to accommodate high resolution only if you don't have enough VRAM or does it always scale back other parts of the NN as you increase the output size? Would you get satisfactory results with large output size (256 - 512) if you train a model on a GPU with high VRAM (RTX 4090 or 3090) or will it still scale back other parts of the NN?

The model itself doesn't scale back anything. When you are designing a model, considerations by the developer have to be made about what to scale and where to get the optimal model that will fit on your GPU. On top of that, there is constant research going into design considerations that can produce better/more efficient models.

So, a large input size might improve the distillation process and help the encoder to output/return a better abstract encoding (Depending on encoder's design and complexity). Is that correct?

It might, or it might not. It is an open question as to whether a model that has been encoded with a 128px input is any better than a model that is encoded with a 512px input. The only real way to find out is to experiment. These experiments take time, so any information that you find and can be reported back would be appreciated. But this does feed back into the design considerations from earlier. Can you save VRAM on the encoder, to keep more complex decoders when upping resolution (and thus, VRAM) requirements? Or is it better to have a larger encoder and scale back on the decoder. My assumption is that the former is better, but it's just a guess at this point.

Thank you for explaining.

Faceswap Forum

Questions regarding input & output size

Questions regarding input & output size

Re: Questions regarding input & output size

Re: Questions regarding input & output size

Re: Questions regarding input & output size

Re: Questions regarding input & output size

Re: Questions regarding input & output size