I'm setting up for training using the unbalanced model with the hope of getting the highest possible resolution for my face. The face swap is being used for a visual effects sequence in a movie where an actor is working alongside herself and I am swapping her face onto her double. I've tested with great results using the Realface model, but now I'm hoping to improve resolution.
If I train on a photoset that is 512 x 512 - what is the setting that I should be using for the input size (training) and for the encoder and decoder "complexity". What specifically does the encoder / decoder value relate to?
I'm training on 4 Tesla T4 GPUs on an AWS G4dn12 server.
It's best to think of it in terms of compression/decompression. The Encoder "compresses" the face into an intermediate form and the Decoder re-creates the original. This is the basics of an autoencoder. In faceswap we use 2 decoders and by switching decoders we switch the output face.
Encoder dims will enable more data to be stored and be stored more intelligently.
Decoder dims will enable better re-creation from the encoder.
You actually need both to get good results. I'd suggest tweaking the decoder up a bit while leaving the encoder at default. Remember that increasing the resolution provides a non-linear time increase (double the resolution is 4x the time). But having 4 T4 GPUs does help with that.