So this is what I have gathered on sources.
First, i don't see much use going over 10K pics. You can do more but debatable , someone will want to argue I understand this is just your example #.
This isn't exactly scientific , just generally how I think about it.
The NN is trying to match up the arbitrary features to learn from. Individually you may not recognize the features with your eye. Once learned, it can put them together, and guess well for things it hasn't seen.
Lets say your trying to replace Ford, actor "A", in just one 5min clip from one movie.
I would pull appx 40% of my faces from that movie (or from the franchise).
I would also extract from that specific clip at a higher rate, adding another 20% .
Then I would round out the other 30% from like any source(s), of him about the same age.
Empire Strikes Back, or whatever. 5K - 9K should be enough.
For the "B" actor, who is the replacement, this is where I really work on diversity.
If your are choosing another movie actor, find good quality clips. Get them from where ever. If you have 10 movies, then get 10 sets from 10 movies. Lighting, expressions, ect..
When you have control of the "B" actors data, like yourself, then rotate that camera. I have people start talking while slowly rotating the view around the head, left right up down. Then do it again outside lighting, and a couple other lighting situations. Maybe one of the "B" making expressions to parody the "A". ( Imagine Hugh Jackman making angry Wolverine face). I end up with about 4x 5min videos. I will extract 8-12 K good images from this.
Give the AI a chance to see every unique curve of the nose, wrinkle by the lip, color of the eye, shape of teeth, whatever it latches on to.
I personally use lossless PNG for training. Make giant folders of several gigs from all sources. Pick through them before extract and sort. Its not fast, but makes me happy.