Challenges faceswapping VR videos

Discussions about research, Faceswapping and things that don't fit in the other categories here.


Post Reply
User avatar
cosmico
Posts: 94
Joined: Sat Jan 18, 2020 6:32 pm
Has thanked: 13 times
Been thanked: 28 times

Challenges faceswapping VR videos

Post by cosmico »

In this post I will explain why deepfaking VR videos is a lot of effort and most likely not worth your time, and in the process of doing that, I will kind of explain how to deepfake VR videos unintentionally. Also all the reasons why you shouldn't are based off the premise you don't have a super computer or multiple computers. Like me.
These are all just my opinions. I've done a decent amount of VR videos before. Hopefully I can discourage you from wasting your time, or on the other side, give you a fun project testing your patience.

Tl;Dr

  1. VR Headsets are needed to view VR, and most are low quality, limiting the experience regardless of however high quality you can train your deepfake. 2k resolution is low quality for VR. This is probably the most important reason. Unless you can shell out 800+ for a 4k,5k,7k headset, no matter how much time and effort you put into your VR swap, it won't be super high quality.

  2. Because most VR headsets are low quality, to get the best possible quality, you really need to let that sucker train all the way to completion. no shortcuts, you probably have to train that model for weeks,

  3. Extraction and Conversion with massive 2k, 3k, and 5k videos are exponentially....way way way longer.

    • Before you can even extract, train or convert, You have to cut that VR video into mini sized clips, Using a beefy video editor to handle 10gb sized videos, just for your computer to handle faceswapping VR sized videos. And by mini, I mean like 10 seconds long. So if you want a 10 minute VR video, prepare to extract and convert 60 videos.

    • After extracting, If you are training and want to test your progress to see how far the training has come, You have to do a long conversion (for that 10 second mini clip). Then stop faceswap, load up WMR, then load up steam,(you have to do the steam text message security code) then load up Steam VR, then load up the VR video player app in steam vr, just to view it. This is a long process compared to simply clicking a file and windows tv/movie popping up in seconds for non-vr deepfakes.

    • When you're completely finished with training, and you do the crazy long conversion process for every single mini clip, You gotta use that same beefy video editor, put all the mini-clips back together in order, then export it as one file. And that combined export can take many hours to do.

  4. VR videos are essentially fish-eyed lens videos, and as faces move even remotely away from the center of the lens, they begin to stretch out and form massive deformities. This causes a whole host of problems, for example: extraction not recognizing a face is a face, and training learning stretched out faces as being normal.

    • One possible "band-aid" for the training problems caused by the stretched faces in serious fit training sessions every couple minutes of the video, which can be very time consuming if you want to deepfake a long video. You're basically doing more and more training just for one simple video. But to be clear, this isn't a solution, it just helps out.

  5. TL;DR UPDATE after talking with bryanlyon in the comments below, we discussed some potential improvements/solutions to some of the reasons not to vr deepfake above, and I go into what they fix/don't fix in section 5

.
.

Is it worth it? If you want a fun and creative challenge, yes. If you were hoping for something quick and easy, not even close.
.
So the Main reason why you shouldn't is it's an unbelievable amount of time and effort. Here are some examples of what I mean.
.

  1. VR headsets are the real limiter here to your experience.
    No matter how good your model is, and how perfect the subject is, you are going to be limited by your vr headset.
    As a poor vr gamer, I'm very aware of the specs and prices of all the vr headsets available. And unless your dropping close to 1000 dollars, your average vr headset will only have a max resolution quality of 2k, maybe 3k if you have the newest one. This sounds like a lot of resolution...I mean 1080p is more than good enough for you videos on youtube. But 2k and 3k are bad for VR when you are trying to watch IRL (non-grahpics) videos. How bad? Well 2k resolution is like watching a youtube video on a monitor at 480p, and 3k is like 720p. To explain this briefly, look at that 1080p monitor in front of you with a very crisp picture on it. Now imagine you tried to cover all 180 degrees of your peripheral vision with 1080p monitors at the exact same distance. Imagine you can just see nothing but monitors, arranged together like you are Inside a sphere. This would probably look super crisp and high quality, but remember, Each one of these monitors are 1080p individually, all these monitors combined isn't 1080p, or even 2k, or 3k, were talking like 20 monitors all at 1080p. Were talking like 20 x 1080 if you add all those 1080p's together. And this is why 2k and 3k are pretty low res and bad for watching IRL (non-grahpics) videos in VR. (This is not how VR works btw, just using it as a understanding technique). So if you are swapping 2k videos, your experience will be similar to swapping 480p video on a monitor videos due to your headset capping out at 2k. Those 1000 dollar headsets I mentioned, which have 4k, 5k, 6k, 7k resolution do exist, but as the premise of this post is you cant afford a super computer, you probably can't afford those either. (also for those curious, I kept mentioning "IRL (non-grahpics) videos". VR video games don't suffer from this same issue because you don't expect literal photo realism where you can see every single crack in the concrete or the fine dust floating in the sunlight.)
    .

  2. You need your model to reach maximum possible training as a result of reason 1.
    Because your viewing experience in your 2k or 3k headset will be similar to the experience of watching 480p or 720p on your monitor, you need to train that model to the very max so that every last bit of quality can be shown, because you do not have a lot to work with here. Naturally with any swap you do, you should try to train your model to maximum training, but usually you can get away with mostly finished training and get really good results instead of perfect results. Not in VR though, it has to be perfect. So expect to wait out an entire model training. If you are a perfectionist/completionism and always do this regardless, then you wont be affected by this. So basically you'll wait crazy long amount of time. Idk how long your models take, but my rtx 2060 usually takes about 200+ hours to get a dfl-h128 model to almost life like levels of quality.
    .

  3. Extraction and conversion are exponentially longer.
    As I'm sure you know, file size increases exponentially with video resolution. And once you start trying to use faceswap at sizes above 1080p, this fact becomes really noticeable. Also most VR videos tend to be above 30 fps, usually 60. That also means double the size. And don't forget if you are doing a side by side video or an over-under, that means there's 2 faces in every frame, so that's twice the amount of faces that needed to be extracted. With my NVidia rtx 2060, I generally avoid extracting and converting 1080p clips longer than 3 minutes. But I have done 6 minute clips, and have no doubt it could even do 10 minutes clips. When I'm doing 5k resolution VR videos, it will take 20 minutes just to extract a 5k-60fps-20 second clip, and then with a convert batch size of 1, (anything higher will crash) it takes about 30 minutes to convert. Also When it's converting that 5k 60fps 20 second video at the batch size of one, according to my task manager, my CPU is at 100% and my memory likes to dance around 95%. So I literally can't do anything else with the computer while it's converting or the entire computer will freeze. And I mean nothing, not even going through my file explorer or photos. I go on my phone when I do this.
    So yeah, it will take a long time.
    .

    • You have to make your VR video into small clips.
      5, 10, 20 second VR clips are pretty hard to come by. Usually most videos you tend to find, say on youtube are 5, 10, 15 minutes. So you will need a way to download the clips (at that max 2k,3k,5k, resolution) then you will need a video editor which has enough power to edit and render a 5Gb video file. Cut the video down into segments of either 5, 10, 20 seconds, (or the largest your computer can handle) and export each little clip 1 at a time. For me it took Adobe Premier about 1 minute 30 seconds to export a 20 second 60 fps clip. And I did 9 times for a 3 minute clip. So it will just be more time.
      .

    • Testing the progress of your training is annoyingly long.
      This isn't a problem if you the type of person who has no problem just letting you model train for weeks completely un interrupted, but if you are like me, and you want to see if the past day or 2 has made any noticeable difference. First you have to go through a long conversion (refer back to point 3, 30 minutes for a 10 second clip) and then, assuming you are using a Windows mixed reality based VR headset, you have to boot up WMR, then boot up steam, and then steam VR (unless you have a VR video player outside of steam). then you load up the clip in your video player and you can finally see if there's any progress made. This is all resource intensive so you can't run faceswap in the background while doing any of this. You may have another slightly-less-steps involved way of testing and viewing the converted video, but as far as I know, there's no way even close to normal faceswapping where after you convert, you start the training again, then simply click the video file and Windows TV/Movie starts playing it with in seconds. So in other words, an unnecessary amount of time to test.
      .

    • After you downloaded the video, cut it into pieces, extracted, trained, and then finally converted, you then have to put all those pieces back together!
      I don't have a recent measurement I took, but I remember that a 9 minute video 5k video, 60fps, composed of like 50 individual clips, took Adobe premier like 2 hours to export back into one big file. Basically a soul crushing amount of time since your swap is so close to being done but now you have to wait for like an hour.

    .

  4. VR videos are recorded with fish- eye lens and as result are highly distorted.
    Ok, maybe there not technically fish-eye lens, but they are extremely similar in the video you get. And this causes a whole bunch of problems with the faceswapping process. When the face is in the center of the video frame, they look normal. But when they move from the center and towards the edges, they get really a stretched and distorted. In your VR headset, these stretches and distortions looks completely normal because it's projecting that fish-eye video onto the inside of a sphere canceling it out. But that's inside the VR headset. You know what's outside of VR? Faceswap. So faceswap see's it all distorted. Here's some examples of what a difference this can make.
    .

    .
    And this causes a whole bunch of problems. First of all, if a face is to close too the edge, and is too deformed by the fisheye effect, faceswap might not even recognize it as a face. Faceswap's struggles to recognizing a face on the edges of a VR video is only multiplied when there's obstructions on front of the face, usually hair for example. When I'm extracting videos where a face goes this far off the edge, Sometimes I have to set the detect confidence to 5, which you can only do by typing it in instead of the slider, and it will usually results in about 10,000 frames on a video that's only supposed to have like 1000. So you are really getting every single false positive possible just in the attempt to get those few actual positives near the edge. And as I mentioned before if there's long hair obstructing it the face at the edge, you gotta turn on vgg-obstruct and its got to run that super slow vgg-obstruct filter mask over all 10,000 of those faces. Yeah progress gets pretty slow sometimes.
    .
    The next problem caused by this is when faces are stretched out, the model when training thinks this is what the face looks like. And obviously it can have a hard time understanding how to make the connecting formula between a monsters deformed out face and a normal face. This usually results in a very very slim face being placed on what supposed to be a wide face. And this is an issue, because you want the swapped face to be wide for the edges, because when projected into your vr headset, it will turn it back to a normal shape. If you are curious what I mean by this improper face size on the stretched out head, I whipped up an example in photoshop
    Image
    .
    Now I mentioned this all happens in a certain region of the video frame. If you are curious how big this zone is where fish eye lens distortion will effect the video, I made up an image for you to understand right below. The face deforming slightly will happen to some degree at just about every single point on the image, but in the dark green circle, it has almost no effect on the final converted product. In other words the converted face will still look 95-99% normal. Also in the dark green, faceswap usually has no problems recognizing and extracting most front facing faces. The wider light green circle is where things start to get kind of rough. The converted face will begin to look weird in this area, but I've personally never found it to be so bad that it ruins the immersion. However when it comes to extraction, the light green area is where faceswap seriously starts to struggle. At this point you need to seriously lower the detect confidence down to like 15, which will give lots of false positives. Beyond the light green where there's no circle, extraction is almost impossible. You have to set the detect confidence down to like zero, and even then you might only get like 50% of the possible frames if you are lucky, and any obstruction in the face will essentially make it not recognize the face. And when you get faces in this area, the converted product definitely looks bad.
    .
    Image
    .

    • Sunk-cost fallacy with fit training
      Also because the fisheye faces are so wildly and drastically different than normal faces, that generally it feels like every couple of minutes of VR video, it should be fit trained. My thoughts on this are that if I started the first fit training and made it learn this monstrosity of the weird squashed face that you assure faceswap is the same person, then perhaps its ability to do normal faces has become less precise, and the same for not just squashed but also really stretched out vertically faces. If it just did a weirdly squashed face, can it really do a stretched out vertically picture? IDK. What I do know is this. I. Have. Already. Put. This. Much. Time. And. Effort. Into. This. VR. Faceswap. That. I'm. Not. Risking. It. So screw it. Let it re-fit train on the new 3 minute section of video for like 5 hours after it just finished the previous 3 minute section, and over
      and over again. It's a long process.

UPDATE: After Talking with BryanLyon in the comments below, we talked about some possible solutions to some of the reasons I mentioned above, and because I don't like to read other people's rambling conversations in forums, I'm including it in this original post.

  • Using FFmpeg does significantly improve the process, as its way faster than premier, vegas, shotcut etc. but it can only improve some things. If you have a VR video and you want to make one simple cut to it, say you only want the first half of the video, than FFmpeg will do it way faster. But in a situation where you want many many specific short clips in a very long video, using ffmpeg to get all these specific clips then combine them together will be just as long as making a supercut in premier and exporting it.

  • Using FFmpeg to slice the sbs video into two individual videos does solve "reasons 3, 3.1, and 3.3." Basically how long extracting and converting can be with large video resolutions. By cutting that 5400x2700 video into 2 videos that are 2700x2700, it becomes exponentially easier for your computer handle. Instead of a 20 second 5400x2700 clip, my computer can now handle a 120 second 2700x2700 clip, How helpful is this? It has the potential to decrease the amount of time required by ...idk....say 50%. But I think the important thing to remember here is that it's decreasing the time you wait by 50% but its increasing the amount of effort you need to do by 100% (Unless your a master at ffmpeg and can create one single batch command that does all the work for you). Because now you got to slice each video in half, apply the dewarping process to each side of the video, keep track of which video is left and right etc. then after faceswap suture them back together)

  • Dewarping (removing the lens distortion) before giving it to faceswap and adding it back after faceswap does help, but it's not a miracle cure. This partially addresses "reason 4" about stretched our and distorted faces. Here's what it will do: It will help faceswap recognize more faces that got too close to the edge and are stretched out. Not by a large amount though, with a moderate amount dewarping, I estimate Faceswap recognized about 10-15% more faces than it would have if I didn't dewarp it. That was a little disappointing as I was hoping it would "save" every single frame faceswap misses because of that distortion effect. The more important thing is that it really helps correct the shape of all those faces in that light green zone from my picture above, so faceswap will have a better conversion on them, and then once faceswap is done with it and you add the lens distortion back to the video, your faces are now properly distorted! So now they will be properly projected onto the inside of your headset. This also has some effect on the dark green zone in my picture above, but to a very minimal amount since they aren't that distorted to begin with. This however comes with some downsides, and it seems to be an area that I'm not properly understanding BryanLyon on. When you dewarp it before giving it faceswap, you are squishing all those pixels together on the edges of the sphere, and when you do this you lose pixel quality for those pixels affected. After faceswap is done with the file you add the distortion back to it, you are basically stretching those pixels on the edges back out, and if you ever stretched an image in MSpaint, I'm sure you'd realize how low quality that can look. Now bryanlyon has this solution for this below that if I'm honest, doesn't make any sense to me at all; But I realized something. This problem, technically isn't a problem for you. Yes the edges of my 5400x2700 video have had there quality damaged. ............But my headset can't even do 5400x2700. When I played the video with the damaged edge pixels, my headset was low enough resolution that it literally made no difference and looked 99% normal. So if you have an average VR headset like myself, and your doing a vr video that's high res, then you don't have to worry about this.

.
If you'd like a basic explanation on how to use ffmpeg for this purpose from a person who didn't know anything about ffmpeg a couple days ago (and still doesn't know a lot) and has the ability to explain it to you like your 5 years old, just reply in the comments I'll make a guide.

.
Just thought I'd type this up and post as I'm waiting for fit-training to do a brand new 3 minute section in my VR deepfake ;) ;) ;) Hopefully You liked hearing my opinions and maybe found them helpful

Last edited by bryanlyon on Sat Feb 13, 2021 12:58 am, edited 8 times in total.
Reason: Renamed title to make it easier for people to know what the discussion is about.

User avatar
bryanlyon
Site Admin
Posts: 596
Joined: Fri Jul 12, 2019 12:49 am
Answers: 40
Location: San Francisco
Has thanked: 3 times
Been thanked: 153 times
Contact:

Re: Why you shouldn't Faceswap VR videos

Post by bryanlyon »

For VR videos, I recommend first dewarping and then re-warping the video. Dewarping takes the warped image and returns a "flat" view, effectively removing the VR warp effect. You can rewarp the video after you're done converting.

Additionally, VR videos are typically larger in resolution than can be easily accelerated by ffmpeg. For this reason, even if you don't dewarp, you can get better speed by "splitting" the left and right eye into separate videos.

Also, best to do the majority of your training on non-VR video.

Note that while Non-vr 3d movies do not require dewarping, they still have reduced resolution and it works best to train with non 3d data.

Edit: Also, don't use Premiere to stitch videos together, learn how to use ffmpeg which can do that stitching in seconds, even on non-acceleratable video sizes. (though stitching the left and right back into a single frame does take some time)

Last edited by bryanlyon on Wed Feb 10, 2021 8:57 pm, edited 1 time in total.
Reason: Add info on ffmpeg for stitching.

User avatar
cosmico
Posts: 94
Joined: Sat Jan 18, 2020 6:32 pm
Has thanked: 13 times
Been thanked: 28 times

Re: Why you shouldn't Faceswap VR videos

Post by cosmico »

Hey I really appreciate the comment!
I remember you telling me about the de-warping technique in old post of mine, and I tried it out and it work pretty decently. But in the process of doing so, all that pixel data that was supposed to be wide and stretched out was destroyed when it was squished into appropriate facial proportions for faceswap. And re-warping it again after training/ conversion, that pixel data just doesn't become un-destroyed. Like it will be really low resolution looking around all the outer corners of the fisheye. So I'm curious if you ever found a solution to that issue. (Also I could never "re-warp" it back perfectly to the way it was before dewarping it even if I used the exact same settings in the opposite direction. There's always just something a tiny bit off. I'm curious how you solved for this.)

As for the Effmpeg splitting the videos, I I think I remember you mentioning that before As well, but I always assumed you meant literally cropping the video into two halves in a video editor, running them through faceswap then putting them back together, which seemed like too much effort. Is this what that "slice" tool does in tools>effmpeg?

And when it comes to training before VR videos, I whole heartedly agree. Definitely best to train on regular videos first and then fit-train it to the VR video.


User avatar
bryanlyon
Site Admin
Posts: 596
Joined: Fri Jul 12, 2019 12:49 am
Answers: 40
Location: San Francisco
Has thanked: 3 times
Been thanked: 153 times
Contact:

Re: Why you shouldn't Faceswap VR videos

Post by bryanlyon »

The solution to the "lost detail" is actually pretty simple. You just set the output of the dewarp to a much higher resolution to account for the dewarped resolution. Normally, warped images have much higher pixel density at the center since that's where the warp is the least. You just need to set the resolution for account for the lowest quality part instead of the highest part.

Think of it like a globe, when flattened, the globe will always have some degree of distortion. The key is to make sure that you match the quality so that the high quality areas stay high quality.

ffmpeg "slice" is by time. It cuts out a part. Instead you'll run 2 "crops" where you crop the left side, and then crop the right side. This creates 2 separate videos, one for each eye.

If you get used to FFMPEG, you can even create a single script that does both sides at once, dewarps, and puts the video in whatever format you want to work with. Then re-creating the VR video is simply a matter of reversing each step in the process.


User avatar
cosmico
Posts: 94
Joined: Sat Jan 18, 2020 6:32 pm
Has thanked: 13 times
Been thanked: 28 times

Re: Why you shouldn't Faceswap VR videos

Post by cosmico »

I'm definitely going to have to try my hand at dewarping again, hopefully that will erase the majority of problems I put into step 4 of my post; relating to warped faces.

I'd also like to try your method of cropping the video into two using effmpeg and presumably using effmpeg to put it back together, But I'm a little more lost on this one. When you say "run crops on ffmpeg" do you mean type in codes into the GUI-less ffmpeg, or is there a button/setting on faceswap that I'm not seeing, and I can simply do this all on faceswaps gui?

If its the codes answer, could you link me to a place, or a file, or even a youtube video with all the commands I need?


User avatar
bryanlyon
Site Admin
Posts: 596
Joined: Fri Jul 12, 2019 12:49 am
Answers: 40
Location: San Francisco
Has thanked: 3 times
Been thanked: 153 times
Contact:

Re: Why you shouldn't Faceswap VR videos

Post by bryanlyon »

The effmpeg frontend in FaceSwap is just a tiny slice of what ffmpeg can do natively. You'll have to learn to use the CLI to unlock it's full power.

A basic use of ffmpeg to crop is:

Code: Select all

ffmpeg -i in.mp4 -filter:v "crop=out_w:out_h:x:y" out.mp4

User avatar
cosmico
Posts: 94
Joined: Sat Jan 18, 2020 6:32 pm
Has thanked: 13 times
Been thanked: 28 times

Re: Why you shouldn't Faceswap VR videos

Post by cosmico »

With my 5400x2700 video, I managed to get cropping the video in half working, non batch form, using

Code: Select all

ffmpeg -i in.mp4 -filter:v "crop=2700:2700:0" -c:a copy outvideo2.mp4

for the left side video and

Code: Select all

fffmpeg -i in.mp4 -filter:v "crop=2700:2700:2700" -c:a copy outvideo.mp4

for the right side video.

I'm struggling to get ffmpeg to suture them back together the way they started though. The closest I got was

Code: Select all

ffmpeg -i inputvideoright.mp4 -vf "[in] pad=4*iw:ih:iw/2:0 [right]; movie=inputvideoleft.mp4, pad=2*iw:ih:iw/2:0 [left]; [left][right] overlay=main_w/2:0 [out]" -c:v libx264 -preset medium -crf 23 -acodec copy outputvideocombined.mp4

But this isn't working properly. It's getting both videos into one file, but there trimmed to be extremely vertical with large side black bars and its off center with both videos on the right side of the frame. Any Idea where I went wrong with my code/what would fix/what I should use to properly join them again?

There's three instances of "2:0" or "x:x" values in that code, and while Messing with those values moved the video around, I can't seem to find a correlation with the random numbers I'm entering and what the final product ends up looking like.
I'm beyond noob at all this. I had to watch a youtube tutorial just to get ffmpeg running, and I'm like searching reddit posts for people who tried to the same thing and what code they used.

EDIT:
I found a new command

Code: Select all

ffmpeg -i input1.mp4 -i input2.mp4 -filter_complex hstack output.mp4

and it seems to work perfectly, however I noticed in the code that theres no where for me edit the preset quality or the crf quality which many of the codes I couldn't get to work did. I don't know if I need to have that, the combined sbs sutured video does seem to look pretty nice when viewed on windows tv/movie, I guess it just seems like a nice thing to have in case I do need to edit that. Idk if I can just add "-c:v libx264 -crf 23 -preset veryfast" or any of that stuff to the end of and it will still work. Like I said no idea what I'm doing just bumbling along.

Do you have a preferred code for dewarping/ lens distortion? I found this online and was curious if you had any thoughts on it:

Code: Select all

ffmpeg -i in.mp4 -vf "lenscorrection=cx=0.5:cy=0.5:k1=-.25:k2=-.25" out.mp4

User avatar
bryanlyon
Site Admin
Posts: 596
Joined: Fri Jul 12, 2019 12:49 am
Answers: 40
Location: San Francisco
Has thanked: 3 times
Been thanked: 153 times
Contact:

Re: Why you shouldn't Faceswap VR videos

Post by bryanlyon »

Yes, you can just add that, not to the end, but before the output video gets declared. Best advice I can give is to experiment and learn how ffmpeg works. It's really quite powerful.


User avatar
ianstephens
Posts: 12
Joined: Sun Feb 14, 2021 7:20 pm
Has thanked: 1 time

Re: Challenges faceswapping VR videos

Post by ianstephens »

A very interesting post and the next fun project I will be working on (180 degrees stereoscopic).

May I ask - as it hasn't been mentioned - what methods are you guys using for dewarping/unwarping the source files?


Post Reply