After attempting to train using dfl-H128 1000000 times, the model performed well when the model's face was distance from the camera, but had terrible performance when the model's face was close to the camera. I am curious as to why this is the case, and I am currently trying to switch to the Dlight model to train a higher-resolution model. Can this improve the situation?Does training a higher-resolution model become necessary when the face is relatively close to the camera? Additionally, what is the impact of the output size on the transformation results?
what is the impact of the output size on the transformation results?
Read the FAQs and search the forum before posting a new topic.
This forum is for discussing tips and understanding the process involved with Training a Faceswap model.
If you have found a bug are having issues with the Training process not working, then you should post in the Training Support forum.
Please mark any answers that fixed your problems so others can find the solutions.
Re: what is the impact of the output size on the transformation results?
Short answer is, yes. Higher resolution will perform better at close up shots. If you have 1080p image (1920 x 1080 pixels) and you have a model that is trained with an output size of 128px (for example), then if the face is completely filling the frame, that 128px image needs to be enlarged almost 10x. That is not going to give you a great result on almost any image which needs to be blown up that much.
There should be no impact on the transformation results beyond a better final image, so the question becomes: why don't we just always train at higher-res? The answer to that is VRAM and time, pure and simple.
All things being equal, for every doubling of resolution, you quadruple the amount of VRAM required to train that model, and quadruple the amount of time required to train the model which, more often than not, will become unrealistic to train for the end user. However, if you can get to 256px (depending on the model) you should get more satisfactory (though still not perfect) results. The below example is trained at 256px on Phaze-A:
My word is final