Extracting faces for training and converting
Post by nnifj »

The guide states you should use several sources for your input data, specifically using several other videos of Harrison ford for your Indiana jones Deepfake. Does anyone know If you get better results from using many many sources rather than just a few? For example 100 frames from 100 different movies of Harrison Ford equaling 10k frames vs. 3333 frames from 3 movies of Harrison Ford equaling 9999 frames?

And if more sources is better, would it hypothetically be even better to use just 10,000 pictures taken at 10,000 different settings? (That seems unrealistic, but what might be realistic would be trying to make the jpeg-based input faces as high as percentage as possible compared to video in the overall input frames)

Re: Many Sources vs several sources?

Post by abigflea »

So this is what I have gathered on sources.

First, i don't see much use going over 10K pics. You can do more but debatable , someone will want to argue :-) I understand this is just your example #.

This isn't exactly scientific , just generally how I think about it.

The NN is trying to match up the arbitrary features to learn from. Individually you may not recognize the features with your eye. Once learned, it can put them together, and guess well for things it hasn't seen.

Lets say your trying to replace Ford, actor "A", in just one 5min clip from one movie.
I would pull appx 40% of my faces from that movie (or from the franchise).
I would also extract from that specific clip at a higher rate, adding another 20% .
Then I would round out the other 30% from like any source(s), of him about the same age.
Empire Strikes Back, or whatever. 5K - 9K should be enough.

For the "B" actor, who is the replacement, this is where I really work on diversity.
If your are choosing another movie actor, find good quality clips. Get them from where ever. If you have 10 movies, then get 10 sets from 10 movies. Lighting, expressions, ect..

When you have control of the "B" actors data, like yourself, then rotate that camera. I have people start talking while slowly rotating the view around the head, left right up down. Then do it again outside lighting, and a couple other lighting situations. Maybe one of the "B" making expressions to parody the "A". ( Imagine Hugh Jackman making angry Wolverine face). I end up with about 4x 5min videos. I will extract 8-12 K good images from this.

Give the AI a chance to see every unique curve of the nose, wrinkle by the lip, color of the eye, shape of teeth, whatever it latches on to.

I personally use lossless PNG for training. Make giant folders of several gigs from all sources. Pick through them before extract and sort. Its not fast, but makes me happy.
Re: Many Sources vs several sources?

Post by BCBC »

So abigflea, when you're extracting for training do you pay much attention to fixing alignments manually?

Again, when extracting for convert do you do much of this? If so, is there anything particular you're looking for with the alignments, ie where specific landmarks should be?

