-To reiterate what bryanlyon said but in more detail, Your data is the most important. If you are swapping trumps face onto Nicolas cage, Your trump data and your Nicolas cage data should come from well-lit, clear, 1080p videos or 4k if you can, and it should not come from low quality videos like what you might find on their instagram or facebook videos.
-In addition to that remember that we use video dimensions as way of saying data quality, but I think in reality what we mean is "face size". For example a 720p video of Trump and Cage that a super zoom closeup of his face where you can see each hair follicle will be better quality data than a 1080p video where trump or cage are really far away and there face is really tiny. It's just implied that larger dimensions allow you to have larger facesizes. basically how many pixels in the face.
-when it comes to the animation aspect, if you want your model to be able to do something, it needs to know how to do it. I once did a smaller data set model where my data for the original person was smiling in 90% of the files and the other 10% was emotionless. When I tried converting it onto a video of someone talking, there head moved around nicely but there mouth was essentially sealed until they showed a tiny bit of teeth, and when they did the model put the massive grin on it. The problem was because my data didn't tell it how to blink, how to move its eyes and eyebrows around, how to open its mouth to talk and not smile or only show a tiny smile, because my data couldn't teach the model those things, the model didn't know how to deal with it when it was converted onto a video that did all those things. The opposite end of the spectrum is possible to, If all your source material is of someone with their eyes closed and mouth open, the model won't know how to open its eyes or close its mouth This is likely the problem that you mentioned in your second post
-Another thing I found to be super important is the size of the face you will be converting it to vs what resolution model you are using. The more you have to stretch the "modeled face" to cover the actual face, the worse it looks. Original, lightweight, IAe are all 64x64 models and the best results will be if you convert a final product where the face is roughly 64x64 in size. That doesnt mean you can't convert a 1080 video, just that the best results will be if the face is really small in that 1080p video like they are really far away, rather than if it was a closeup. So basically when building a model, use the biggest high quality facesizes you can find, then when you convert it, convert onto a face thats closest to the size of models resolution.
Here a pic of the exact same small resolution 64x64 model converting onto two different sized faces. Both are bigger than 64x64, I think the left is like 150x150 sized face while the right is 380x380, but because the 64 model has to stretch less for the left, it ends up looking better
Oh and if your next thought is I'll just do a higher resolution model then, remember that doubling the resolution size does not mean doubling the training time. its more than that. like exponentially/raising to the power of 2 amount of time.