In this post I will explain why deepfaking VR videos is a lot of effort and most likely not worth your time, and in the process of doing that, I will kind of explain how to deepfake VR videos unintentionally. Also all the reasons why you shouldn't are based off the premise you don't have a super computer or multiple computers. Like me.
These are all just my opinions. I've done a decent amount of VR videos before. Hopefully I can discourage you from wasting your time, or on the other side, give you a fun project testing your patience.
Tl;Dr
VR Headsets are needed to view VR, and most are low quality, limiting the experience regardless of however high quality you can train your deepfake. 2k resolution is low quality for VR. This is probably the most important reason. Unless you can shell out 800+ for a 4k,5k,7k headset, no matter how much time and effort you put into your VR swap, it won't be super high quality.
Because most VR headsets are low quality, to get the best possible quality, you really need to let that sucker train all the way to completion. no shortcuts, you probably have to train that model for weeks,
-
Extraction and Conversion with massive 2k, 3k, and 5k videos are exponentially....way way way longer.
Before you can even extract, train or convert, You have to cut that VR video into mini sized clips, Using a beefy video editor to handle 10gb sized videos, just for your computer to handle faceswapping VR sized videos. And by mini, I mean like 10 seconds long. So if you want a 10 minute VR video, prepare to extract and convert 60 videos.
-
After extracting, If you are training and want to test your progress to see how far the training has come, You have to do a long conversion (for that 10 second mini clip). Then stop faceswap, load up WMR, then load up steam,(you have to do the steam text message security code) then load up Steam VR, then load up the VR video player app in steam vr, just to view it. This is a long process compared to simply clicking a file and windows tv/movie popping up in seconds for non-vr deepfakes.
-
When you're completely finished with training, and you do the crazy long conversion process for every single mini clip, You gotta use that same beefy video editor, put all the mini-clips back together in order, then export it as one file. And that combined export can take many hours to do.
-
VR videos are essentially fish-eyed lens videos, and as faces move even remotely away from the center of the lens, they begin to stretch out and form massive deformities. This causes a whole host of problems, for example: extraction not recognizing a face is a face, and training learning stretched out faces as being normal.
-
One possible "band-aid" for the training problems caused by the stretched faces in serious fit training sessions every couple minutes of the video, which can be very time consuming if you want to deepfake a long video. You're basically doing more and more training just for one simple video. But to be clear, this isn't a solution, it just helps out.
-
TL;DR UPDATE after talking with bryanlyon in the comments below, we discussed some potential improvements/solutions to some of the reasons not to vr deepfake above, and I go into what they fix/don't fix in section 5
.
.
Is it worth it? If you want a fun and creative challenge, yes. If you were hoping for something quick and easy, not even close.
.
So the Main reason why you shouldn't is it's an unbelievable amount of time and effort. Here are some examples of what I mean.
.
VR headsets are the real limiter here to your experience.
No matter how good your model is, and how perfect the subject is, you are going to be limited by your vr headset.
As a poor vr gamer, I'm very aware of the specs and prices of all the vr headsets available. And unless your dropping close to 1000 dollars, your average vr headset will only have a max resolution quality of 2k, maybe 3k if you have the newest one. This sounds like a lot of resolution...I mean 1080p is more than good enough for you videos on youtube. But 2k and 3k are bad for VR when you are trying to watch IRL (non-grahpics) videos. How bad? Well 2k resolution is like watching a youtube video on a monitor at 480p, and 3k is like 720p. To explain this briefly, look at that 1080p monitor in front of you with a very crisp picture on it. Now imagine you tried to cover all 180 degrees of your peripheral vision with 1080p monitors at the exact same distance. Imagine you can just see nothing but monitors, arranged together like you are Inside a sphere. This would probably look super crisp and high quality, but remember, Each one of these monitors are 1080p individually, all these monitors combined isn't 1080p, or even 2k, or 3k, were talking like 20 monitors all at 1080p. Were talking like 20 x 1080 if you add all those 1080p's together. And this is why 2k and 3k are pretty low res and bad for watching IRL (non-grahpics) videos in VR. (This is not how VR works btw, just using it as a understanding technique). So if you are swapping 2k videos, your experience will be similar to swapping 480p video on a monitor videos due to your headset capping out at 2k. Those 1000 dollar headsets I mentioned, which have 4k, 5k, 6k, 7k resolution do exist, but as the premise of this post is you cant afford a super computer, you probably can't afford those either. (also for those curious, I kept mentioning "IRL (non-grahpics) videos". VR video games don't suffer from this same issue because you don't expect literal photo realism where you can see every single crack in the concrete or the fine dust floating in the sunlight.)
.You need your model to reach maximum possible training as a result of reason 1.
Because your viewing experience in your 2k or 3k headset will be similar to the experience of watching 480p or 720p on your monitor, you need to train that model to the very max so that every last bit of quality can be shown, because you do not have a lot to work with here. Naturally with any swap you do, you should try to train your model to maximum training, but usually you can get away with mostly finished training and get really good results instead of perfect results. Not in VR though, it has to be perfect. So expect to wait out an entire model training. If you are a perfectionist/completionism and always do this regardless, then you wont be affected by this. So basically you'll wait crazy long amount of time. Idk how long your models take, but my rtx 2060 usually takes about 200+ hours to get a dfl-h128 model to almost life like levels of quality.
.Extraction and conversion are exponentially longer.
As I'm sure you know, file size increases exponentially with video resolution. And once you start trying to use faceswap at sizes above 1080p, this fact becomes really noticeable. Also most VR videos tend to be above 30 fps, usually 60. That also means double the size. And don't forget if you are doing a side by side video or an over-under, that means there's 2 faces in every frame, so that's twice the amount of faces that needed to be extracted. With my NVidia rtx 2060, I generally avoid extracting and converting 1080p clips longer than 3 minutes. But I have done 6 minute clips, and have no doubt it could even do 10 minutes clips. When I'm doing 5k resolution VR videos, it will take 20 minutes just to extract a 5k-60fps-20 second clip, and then with a convert batch size of 1, (anything higher will crash) it takes about 30 minutes to convert. Also When it's converting that 5k 60fps 20 second video at the batch size of one, according to my task manager, my CPU is at 100% and my memory likes to dance around 95%. So I literally can't do anything else with the computer while it's converting or the entire computer will freeze. And I mean nothing, not even going through my file explorer or photos. I go on my phone when I do this.
So yeah, it will take a long time.
.-
You have to make your VR video into small clips.
5, 10, 20 second VR clips are pretty hard to come by. Usually most videos you tend to find, say on youtube are 5, 10, 15 minutes. So you will need a way to download the clips (at that max 2k,3k,5k, resolution) then you will need a video editor which has enough power to edit and render a 5Gb video file. Cut the video down into segments of either 5, 10, 20 seconds, (or the largest your computer can handle) and export each little clip 1 at a time. For me it took Adobe Premier about 1 minute 30 seconds to export a 20 second 60 fps clip. And I did 9 times for a 3 minute clip. So it will just be more time.
. -
Testing the progress of your training is annoyingly long.
This isn't a problem if you the type of person who has no problem just letting you model train for weeks completely un interrupted, but if you are like me, and you want to see if the past day or 2 has made any noticeable difference. First you have to go through a long conversion (refer back to point 3, 30 minutes for a 10 second clip) and then, assuming you are using a Windows mixed reality based VR headset, you have to boot up WMR, then boot up steam, and then steam VR (unless you have a VR video player outside of steam). then you load up the clip in your video player and you can finally see if there's any progress made. This is all resource intensive so you can't run faceswap in the background while doing any of this. You may have another slightly-less-steps involved way of testing and viewing the converted video, but as far as I know, there's no way even close to normal faceswapping where after you convert, you start the training again, then simply click the video file and Windows TV/Movie starts playing it with in seconds. So in other words, an unnecessary amount of time to test.
. After you downloaded the video, cut it into pieces, extracted, trained, and then finally converted, you then have to put all those pieces back together!
I don't have a recent measurement I took, but I remember that a 9 minute video 5k video, 60fps, composed of like 50 individual clips, took Adobe premier like 2 hours to export back into one big file. Basically a soul crushing amount of time since your swap is so close to being done but now you have to wait for like an hour.
.
-
VR videos are recorded with fish- eye lens and as result are highly distorted.
Ok, maybe there not technically fish-eye lens, but they are extremely similar in the video you get. And this causes a whole bunch of problems with the faceswapping process. When the face is in the center of the video frame, they look normal. But when they move from the center and towards the edges, they get really a stretched and distorted. In your VR headset, these stretches and distortions looks completely normal because it's projecting that fish-eye video onto the inside of a sphere canceling it out. But that's inside the VR headset. You know what's outside of VR? Faceswap. So faceswap see's it all distorted. Here's some examples of what a difference this can make.
.
.
And this causes a whole bunch of problems. First of all, if a face is to close too the edge, and is too deformed by the fisheye effect, faceswap might not even recognize it as a face. Faceswap's struggles to recognizing a face on the edges of a VR video is only multiplied when there's obstructions on front of the face, usually hair for example. When I'm extracting videos where a face goes this far off the edge, Sometimes I have to set the detect confidence to 5, which you can only do by typing it in instead of the slider, and it will usually results in about 10,000 frames on a video that's only supposed to have like 1000. So you are really getting every single false positive possible just in the attempt to get those few actual positives near the edge. And as I mentioned before if there's long hair obstructing it the face at the edge, you gotta turn on vgg-obstruct and its got to run that super slow vgg-obstruct filter mask over all 10,000 of those faces. Yeah progress gets pretty slow sometimes.
.
The next problem caused by this is when faces are stretched out, the model when training thinks this is what the face looks like. And obviously it can have a hard time understanding how to make the connecting formula between a monsters deformed out face and a normal face. This usually results in a very very slim face being placed on what supposed to be a wide face. And this is an issue, because you want the swapped face to be wide for the edges, because when projected into your vr headset, it will turn it back to a normal shape. If you are curious what I mean by this improper face size on the stretched out head, I whipped up an example in photoshop
.
Now I mentioned this all happens in a certain region of the video frame. If you are curious how big this zone is where fish eye lens distortion will effect the video, I made up an image for you to understand right below. The face deforming slightly will happen to some degree at just about every single point on the image, but in the dark green circle, it has almost no effect on the final converted product. In other words the converted face will still look 95-99% normal. Also in the dark green, faceswap usually has no problems recognizing and extracting most front facing faces. The wider light green circle is where things start to get kind of rough. The converted face will begin to look weird in this area, but I've personally never found it to be so bad that it ruins the immersion. However when it comes to extraction, the light green area is where faceswap seriously starts to struggle. At this point you need to seriously lower the detect confidence down to like 15, which will give lots of false positives. Beyond the light green where there's no circle, extraction is almost impossible. You have to set the detect confidence down to like zero, and even then you might only get like 50% of the possible frames if you are lucky, and any obstruction in the face will essentially make it not recognize the face. And when you get faces in this area, the converted product definitely looks bad.
.
.-
Sunk-cost fallacy with fit training
Also because the fisheye faces are so wildly and drastically different than normal faces, that generally it feels like every couple of minutes of VR video, it should be fit trained. My thoughts on this are that if I started the first fit training and made it learn this monstrosity of the weird squashed face that you assure faceswap is the same person, then perhaps its ability to do normal faces has become less precise, and the same for not just squashed but also really stretched out vertically faces. If it just did a weirdly squashed face, can it really do a stretched out vertically picture? IDK. What I do know is this. I. Have. Already. Put. This. Much. Time. And. Effort. Into. This. VR. Faceswap. That. I'm. Not. Risking. It. So screw it. Let it re-fit train on the new 3 minute section of video for like 5 hours after it just finished the previous 3 minute section, and over
and over again. It's a long process.
-
UPDATE: After Talking with BryanLyon in the comments below, we talked about some possible solutions to some of the reasons I mentioned above, and because I don't like to read other people's rambling conversations in forums, I'm including it in this original post.
Using FFmpeg does significantly improve the process, as its way faster than premier, vegas, shotcut etc. but it can only improve some things. If you have a VR video and you want to make one simple cut to it, say you only want the first half of the video, than FFmpeg will do it way faster. But in a situation where you want many many specific short clips in a very long video, using ffmpeg to get all these specific clips then combine them together will be just as long as making a supercut in premier and exporting it.
Using FFmpeg to slice the sbs video into two individual videos does solve "reasons 3, 3.1, and 3.3." Basically how long extracting and converting can be with large video resolutions. By cutting that 5400x2700 video into 2 videos that are 2700x2700, it becomes exponentially easier for your computer handle. Instead of a 20 second 5400x2700 clip, my computer can now handle a 120 second 2700x2700 clip, How helpful is this? It has the potential to decrease the amount of time required by ...idk....say 50%. But I think the important thing to remember here is that it's decreasing the time you wait by 50% but its increasing the amount of effort you need to do by 100% (Unless your a master at ffmpeg and can create one single batch command that does all the work for you). Because now you got to slice each video in half, apply the dewarping process to each side of the video, keep track of which video is left and right etc. then after faceswap suture them back together)
Dewarping (removing the lens distortion) before giving it to faceswap and adding it back after faceswap does help, but it's not a miracle cure. This partially addresses "reason 4" about stretched our and distorted faces. Here's what it will do: It will help faceswap recognize more faces that got too close to the edge and are stretched out. Not by a large amount though, with a moderate amount dewarping, I estimate Faceswap recognized about 10-15% more faces than it would have if I didn't dewarp it. That was a little disappointing as I was hoping it would "save" every single frame faceswap misses because of that distortion effect. The more important thing is that it really helps correct the shape of all those faces in that light green zone from my picture above, so faceswap will have a better conversion on them, and then once faceswap is done with it and you add the lens distortion back to the video, your faces are now properly distorted! So now they will be properly projected onto the inside of your headset. This also has some effect on the dark green zone in my picture above, but to a very minimal amount since they aren't that distorted to begin with. This however comes with some downsides, and it seems to be an area that I'm not properly understanding BryanLyon on. When you dewarp it before giving it faceswap, you are squishing all those pixels together on the edges of the sphere, and when you do this you lose pixel quality for those pixels affected. After faceswap is done with the file you add the distortion back to it, you are basically stretching those pixels on the edges back out, and if you ever stretched an image in MSpaint, I'm sure you'd realize how low quality that can look. Now bryanlyon has this solution for this below that if I'm honest, doesn't make any sense to me at all; But I realized something. This problem, technically isn't a problem for you. Yes the edges of my 5400x2700 video have had there quality damaged. ............But my headset can't even do 5400x2700. When I played the video with the damaged edge pixels, my headset was low enough resolution that it literally made no difference and looked 99% normal. So if you have an average VR headset like myself, and your doing a vr video that's high res, then you don't have to worry about this.
.
If you'd like a basic explanation on how to use ffmpeg for this purpose from a person who didn't know anything about ffmpeg a couple days ago (and still doesn't know a lot) and has the ability to explain it to you like your 5 years old, just reply in the comments I'll make a guide.
.
Just thought I'd type this up and post as I'm waiting for fit-training to do a brand new 3 minute section in my VR deepfake Hopefully You liked hearing my opinions and maybe found them helpful