So I decided to faceswap my female friend onto a popular youtuber with over 11 million subs (Who I won't name)
My female friend was curious about deepfakes and volunteered to be in this swap. I'll post a bunch of info about the swap after the video just in case you were curious. If the video doesn't autoplay 720p, than change it.
After 61 hours of training, this is what I have.
Here is a direct comparison of the two clips
For just 61 hours, I'm pretty happy how it came out. I probably could have squeezed out decently more quality/realism by letting it train for another 50-100 hours, but I was happy enough with this. The eyes do not have enough detail to show the pupils very well, and while more training would fix this, its also partially my fault as my female friend has dark brown eyes which really blends in with the "normal black blur eyes" faceswap makes before getting iris detail, and my data video was in a relatively dark environment which makes her iris/pupil darker. If I would have used a girl with bright blue eyes, or even a brightly lit environment with light shining onto the dark brown eyes, then my model's iris/pupils would still be blurry, yes, but it would look like it has made a lot more iris progress than it currently shows. Also it turns out I didn't have nearly as much teeth data as I thought I had. Over all I would say the swap looks pretty decent as long as you keep the video the same size as the small youtube playbox on the forum. If you go into theater mode size or even full screen, you notice the mistakes easier.
Why dfl-h128? Well, I went with DFL-H128 because I wanted a 128 input/output model, and DFL-H128 seems to be the fastest training of all 128/128 with an eg/s around 90 on my hardware. Unbalanced, realface, and villain (when all set up for 128 input/output) would all give me around 40-60 eg/s. Dlight....*shivers*.... the Dlight model will crash if I even look at in the wrong way. Dfaker will give me 140 eg/s but it's only 64/input 128/output. I don't know if having the highest Eg/s makes dfl-h128 the best 128/128 model to use though, I'm sure realface and villain are capable of giving more realistic results, but are these more realistic results better than the results of a lower quality model like dfl-h128, which because of its speed gets 50-100% more training in the same amount of time? Who knows. I certainly don't. So I went with more training over the higher quality model.
For the majority of the training I left the batch size at 28 because of the 80% principle for single computer use I developed. That principle is this: A batch size of 34 is the largest my computer can handle before low memory crashing. So I set my batch size to be about 80% of that max it can do, because the eg/s will be only a tiny bit underneath the max batch size it can handle, but it leaves me with enough memory to run photoshop + 10 tabs of chrome, or I can run Adobe Premier's memory-hogging-fat-ass. After getting the face mostly there, I would then switch to batch size of 16, and then eventually 12 because the smaller batch sizes supposedly help train fine details better. For loss function, the guide and the app help boxes both state SSIM potentially delivers the most realistic results, but I've personally found that MAE has always gotten me super close to max training possible, so I normally use it. Also MSE never sounded appealing. I went with a 5.1e-5 learning rate because I'm still trying to figure it out. I understand in concept what the app's help box is saying about learning rate when it mentions "a rock trying to get down the hill", but personally, I've never seen results that have correlated in anyway to changing the learning rate. So no idea what it does in hard concrete terms.
Batch Size: Mostly 28, 16, and 12
Loss function: Mae
Mask Loss Function: Mae
Penalized Mask Loss
Eye and Mouth Multipliers: Left on default until the end, in which it was increased.
Learning Rate: 5.1e-5
Allow Growth, Mixed Precision
Every thing else not mentioned was left on default
GeForce RTX 2060
16Gb RAM @ 3300
Ryzen 5 3600 6-Core
Training speeds/Batch Size
Each ran for 30 minutes:
Batch 12: 77.5 eg/s
Batch 18: 77.0 eg/s
Batch 24: 87.6 eg/s
Batch 30: 91.6 eg/s
Batch 36: Crashes
As I mentioned in the title, I had a female friend volunteer to be in the faceswap. She let me record a video of her face at almost every angle, while also doing many facial expressions. This was a HD video recorded by cellphone. I'll show you what the video looked like, ...but for a tiny bit of anonymity, I'm going to show it as a very trimmed down, very low res gif. This data was pretty uniform with the exact same lighting, sharpness, and color in almost every frame.
2220 frames in B data set
When choosing who to put the face, I have a pretty decent idea on good faces for swapping. Using photoshop, I put my female friend face over this youtuber's face, doing my best to make sure the angles were right, and I made sure the layers were somewhat translucent so I could see both at the same time when they were on top of another. Doing this, I could tell that facial features like eyes, nose, mouth were pretty close to the same place, I could also tell that things you can't control for in faceswap were also correct, like making sure their facial shape was similar, their jaws came to a similar point, and they had a similar amount of forehead. I only used one
single youtube video as the source for all the clips in this data set because I really liked the way she looked in this one video. Other videos by her felt like the final conversion would result in the swap not looking like my friend, no matter how well trained it was. Like I mentioned earlier, I'm not mentioning the youtubers name even though its obvious who she is.
2590 Frames is A Data set
Extraction was s3fd detector, fan aligner, Hist normalization. not sure what else I can say.
This is a compilation video of the first 6 seconds at 5 different stages during training. The video repeats 5 times. My final product I showed at the beginning is not included in here.
Here's a gif showing the same exact 5 frames of the video, but I captured the frames at various levels of training. (The same 5 training times mentioned above)
Here's a graph in case you were curious.
-Title was tiny bit clickbait since I said my "girl friend", implying she is my "girlfriend", when she is quite literally just a "female friend". I would later constantly refer to her in the article as my female friend. I am not dating her.
-All the videos and gifs showing the various stages of training are all at 190 sharp and 19 quality while the final video is at 200 sharp and 10 crf quality
-Also for the final video, I edited it in premier after the conversion was done to improve it, by adding a little blurriness and sharpness when applicable. This got me more "realistic" results than what just converting alone could do, and it's something all swappers should consider doing at the end.
-I'm aware the 5 stage video spells beginning wrong. I think they put that extra "n" in the word just to trick people.
If any of this was helpful, leave a thanks! My next project I'll be sharing on this forum will be seeing if I can make a decent looking swap while only using 4 source images (jpegs) for my B data set, instead of the recommended thousands, using a secret trick