At what point do faces start to 'pop' ?

JimmyBoy · Post by **JimmyBoy** » Thu Jun 11, 2020 6:08 pm

I realise this may be a difficult question to answer as many different variables are involved, but is there any guidance as to when you should see faces start to 'pop', i.e there is defined clarity - eyes are sharp, teeth are defined etc...

I'm currently at 300,000 iterations and faces still look somewhat blurred. Eyes do not have sharpness and teeth are blocks of white rather than clear individual teeth. I'm referring to Original > Original and Swap > Swap images in the time-lapse.

This is using the original model with default settings. I'm just unsure of how to balance 'give it more time' vs 'my faces data set are not of good enough quality'. My current losses are below.

Code: Select all

Average since last save: face_loss_A: 0.01643, face_loss_B: 0.01709

I have 3100 A faces, and 3900 B faces.

Any guidance appreciated.

Post by **torzdf** » Fri Jun 12, 2020 8:24 am

It depends on a lot of factors: the model used, the quality of data etc. etc.

It has been a long time since I trained the Original model, but it is entirely possible that individual teeth may not ever appear on that model as it is fairly low detail.

Tekniklee · Post by **Tekniklee** » Fri Jun 12, 2020 8:56 am

If you're at 300K and things are still fuzzy, you probably need to do some better curating. The model you're using could be it, but more likely it's just quality. I'm very restricted in the models I can use because of my VRAM (4Gb), but DFL-SAE 64-bit works very well.

It is VERY important to get good quality sources. This took me months to understand fully, despite it being quite clear in the guides. You only have a 256x256 pixel to work with. You can get a remarkably clear face at that resolution, but even if you are pulling faces from a 640x480 source the face would need to be over half the frame to get whatver maximum clarity is on that source. Everything greater gives the same resolution, but everything less is something less. Many (most) images come from 1/4 or less of the screen.

Looking at the teeth on extracted faces is a good guide, and I usually split my image into at least two training groups - Initial and Final. During initial training (the first 25-50K or so), it doesn't really seem to matter if things are a bit fuzzy, because the model is still fuzzy anyway. Variety is important. So if you can see the lines between teeth clearly, but NOT sharply, that is the minimum quality for even initial training. Delete anything less. If the lines between the teeth are sharp and clear and the areas around the eyes (iris, pupil, eyewhite/skin bounry, lashes) are sharp, then I put those into th Final set. Once you finish culling, I usually combine the Final set into the Initial set (creating a larger Initial set), and then train with only the Final set. I have trained clear models with as few as several hundred images, but they need to be sharp and they need to cover all the angles and expressions. I would definatly stop your training and split each of your sets up. I ain't gonna lie, doing this is time consuming.

If you are trying to get asymmetric facial features like moles and freckles to "pop", you will need to take it one more step by creating a "BEST" training set that has at least a thousand ONLY beautiful imags with your feature present. Initial training on "okay" images for 25K, an additional 25-50K on the Final set (which isn't final anymore), and then another 50K minimum using the Best set to get the features to pop. Usually by 125K, I can get clear moles, and by 150K freckle bands should be visible. The reason for this extra step is you need a stable model in order to perfectly register and accumulate asymmetric features. If the model is fuzzy, then by definition things are not locked in yet. When dealing with things like frackels that are only a few pixels wide at most, I usually get enough "clarity" by around 150K with the DFL-SAE 64-bit model to begin my restricted set. If you have several thousand beautiful images, of course, just thrown them all into training. Most of the time, quality is an issue, and graduated training is how I deal with it. Keep in mind that you will need to turn flip off if going for asymmetric features, so your effective training set size is cut in half. But an asymmetric face is a beautiful face, and worth the effort.

Also, if you are pulling from a high quality source (blu-ray, etc), try setting the minimum size during extract. This is actually the diagonal, which for a 256x256 square is something like 280 (you can figure it out - the square of the hypotenuse is equal to the sum of the squares of the opposite two sides). But that's too low because times someone is taking up more than 1/4 screen are fairly rare. I usually set this for at least 100. What this does is ignore any images that start from something about a quater as big (like 75x75ish). So on a 1080p source, something like 1/10th screen height. That's pretty small even then, but sometimes there are some really good expressions are good for Initial training.

There is a school of thought that says just extract everything, sort by face, cut off the ends, sort by blur and cut of the ends and then train on everythig else. IRL that usually sucks, especially with other-than-perfect source material. Don't get discouraged.

Faceswap Forum

At what point do faces start to 'pop' ?

At what point do faces start to 'pop' ?

Re: At what point do faces start to 'pop' ?

Re: At what point do faces start to 'pop' ?