Typically you will hear you need a minimum of 1000, although more is better and this is to show you why.
Lets say you have a video you want to transfer your friend to.
The original actor would be A ( "A" for Actor).
You have plenty of Actor A. 30 min of video, and 500 good quality pics.
Your friend is "B" , you have very few. As can be expected, this isn't good.
In these example videos I have used increasingly less faces from A for training.
You'll see the quality decline. This will give you an Idea of why you need more data (pics & videos)for a better quality swap.
Bear in mind, I didn't run these models to completion, only enough to show you the decline in quality.
All trained ~200K iterations each. Actor A had ~6K faces extracted
Actor B with 5400 faces
Actor B with 600 faces
Actor B with 180 faces
Actor B with 78 faces
Actor B with 32 faces
Actor B with 8 faces
As you can see, it gets bad with less data. The inability of the model to convert expressions, head movements. Just gets ugly.
What you should take away from this is MORE DATA for reasonably good results.
I dunno what I'm doing
RTX 2070 x2 : RTX 3060TI : GTX 1070 x2 : Ghetto 1060