I saw the recent post comparing the different models, and the quality difference between them – which has been a great help! Since then, I’ve been trying to wrap my head around input and output sizes compared to the resolution of the final video – and I did the rough diagram below to help me decide which sizes I should use for the best results.
My go-to model at the moment seems to be DFL-SAE (using 2x2070 supers, 128GB system RAM and Ryzen 3900X CPU), and so far, I have only gone up to 192 input/ output, and got better results than previously, thanks to info provided in the aforementioned post. But I am determined to try to get at least a decent quality medium close-up shot (if not perfect), rather than the blurry result I seem to be stuck with so far....
I’m aware that the quality of the B face source is paramount here – the better the resolution, the more varied the sources we feed it, and the more examples we can provide from all the angles needed, etc, the better. Plus, that combined with which model we use to train, along with the amount of time we are prepared to let it train.
Let’s say that I have a good source for training, and I am prepared to let it train for a minimum of 70-100 hours – preferably a week, possibly more, for good results (depending on the final shot result for Face A, the angles seen in that footage, that Face B will be swapped onto, how many close-ups or medium sized shots, if any, etc, etc). I realise that close-ups are not something that Faceswap is good at – unless left to run for a very long time, with great training sources and a high res model.
But in my quest for better results, it suddenly occurred to me, that if I were to extract from a 4K source at say, an input/ output of 256 - from a 4K source, and train that, using great sources – but then do the final conversion onto footage of 720p resolution, surely those close-ups would then look better, than if I tried to convert them onto 4K footage?
(There are options for upscaling the footage afterwards: I’ve been dabbling with Topaz Labs’ Video Enhance AI, with mixed results…)
Or is it best to convert to 4K - keeping both A and B sides the same resolution - and downscale the final result to 1080p HD afterwards?
Obviously, if I had a GPU with more VRAM (oh, for an RTX 3090…!!), I’d be going for a model with higher input/ output sizes, and longer training times to get better results – but I’m trying to stick with my current setup limitations without killing my system, AND still attempting to get better results…!
So, would this actually work? (To my tiny brain, looking at the diagram below, logically, it seems like this might work - at least, it makes sense to me…!)
Hopefully, my diagram will make more sense of what I am trying to ask…!
Or does it make no difference, ultimately, whether I convert to a 720p shot, or a 4K one, and that the final result will just be purely down to which model I use and how long I let it train for?
(Apologies for the long post - just trying to make sure I had covered everything...!)