Yah new sutff, a vision transformer CLipV

Ryzen1988 · Post by **Ryzen1988** » Sat Aug 05, 2023 2:08 pm

It was real fun because yesterday i thought about making a long post about nothing happening in AI land for deepfakes, still the same old CNN.
And in the post i was gonna do some hopeful prediction about vision transformers maybe being a cool thing, from the stablediffusion stuff.
Oke i didn't post, today i updated faceswap and holy moly there is a clipV encoder.

Awesome, anyone already experimenting with it?

Wil update of course when done some runs with it

-Update 1
Really runs like a charm, did some reading into it and the clipv_farl-B-16-64 seemed like a really good optimized choice to begin with.
Trains incredibly stable and give quick results.
Have not tried the exact presets because created a IAE style model for the first test
I have noticed that for the first time i can use mixed precision with adebelief with the epsilon on 16. maybe its because the transformer or some other update but i love it.
New favorite encoder.

Post by **torzdf** » Sat Aug 05, 2023 6:12 pm

Glad you are liking it. @bryanlyon did original implementation. I then fully reworked it to work with Phaze-A and to train correctly.

I would like to be able to include the ResNet based version of ClipV as well as ViT, but I can't seem to find the correct way to port the weights for the final layer, so that is currently disabled (although the code remains within Faceswap for if/when I manage to resolve that issue).

The presets are not fully tested, beyond them working, and are more there as a demonstration of structure (due to the fact that ClipV outputs a 512 sized embedding, then a bottleneck is not necessary... it already does that for you... don't know if adding a bottleneck as well would help.... maybe, but probably not).

Biggest advantage for me was that I can fairly easily train 448px models on an 11GB card with the full sized encoder.

Oh, and similar to you, I have managed to train this a long way so far with Mixed Precision, without a NaN in sight. I just hope it stays that way.

Ryzen1988 · Post by **Ryzen1988** » Tue Aug 08, 2023 10:23 pm

I have found a couple of cool things.
The dyn Presets have a 512 bottleneck size, 0 FC depth (fully connected) and an FC dimension of 1 (reshape) so it basically inputs the bottleneck and reshapes it to 1, 1, 512 and goes on from there.
If you take any of the Dyn presets, replace the encoder with the clipv farl-B-16-64 at 100% encoder scaling.
Remove the bottleneck by selecting flatten so it just gives the 512 output of the encoder to the reshape layer and goes on to do what the dyn does.
Is amazing lightweight and superspeedy training! really try it out.

with the clipv i always make the enc load weights, but also freeze the encoder until the faces become clear and only then go on training with unfreeze.
In fact you can take any preset and swap the encoder with the clipv, some work better then other but the dyn seems perfect for it.
Normally i always use different bottle, layer normalizations, dropout or dec gaussian. But since i have toyed around with clip models i noticed that the fastest way to train is to leave that all by the wayside. Do use autoclip but outside that i can train with Adabelief + mixed precision even on small batch sizes around 4-8.

rbanfield82 · Post by **rbanfield82** » Fri Aug 11, 2023 12:43 pm

What Learning Rate and Loss Functions are you using? The training speed is crazy with this setup.

Ryzen1988 · Post by **Ryzen1988** » Fri Aug 11, 2023 7:20 pm

torzdf wrote: ↑Sat Aug 05, 2023 6:12 pm
I would like to be able to include the ResNet based version of ClipV as well as ViT, but I can't seem to find the correct way to port the weights for the final layer, so that is currently disabled (although the code remains within Faceswap for if/when I manage to resolve that issue).

I can't find not many references about this, is this a CNN/transformer hybrid or just a version of clip and vit trained on another dataset?
I see a lot of papers about different hybrid testing with CNN and different transformer aspects.

On the other side, papers always promise the world but real life often is less than ideal

rbanfield82 wrote: ↑Fri Aug 11, 2023 12:43 pm
What Learning Rate and Loss Functions are you using? The training speed is crazy with this setup.

I always use Adabelief, with default learning rate of -5 and the epsilon at -16.
Adabelief is an optimizer who adjust stepsize, so where learning rate of adam has to be adjusted in different parts of the training to perform optimal adabelief has his own corrective function for this.
I am really amazed how little people use this optimizer compared to adam because its just more stable/better/faster.
Downside was use with mixed precision was troublesome, but that seems to be fixed now day, at least with ClipV

Lossfunction I always use ms-ssim with logcosh @100
Logcosh a more ideal version combination of MAE + MSE
Later in the training process i often use Laploss, have not experimented very much with direct A/B comparisons but my ''gut'' feeling always is that is helps in later stages of the training. @50-100
I keep reading about people also using FFL so i just started trying but of course it will take couple of long sessions before i can form any opinion about it.
Eyes & Mouth priority i often do with MAE and just on 2x multiplier.

Post by **torzdf** » Fri Aug 11, 2023 9:57 pm

Ryzen1988 wrote: ↑Fri Aug 11, 2023 7:20 pm
torzdf wrote: ↑Sat Aug 05, 2023 6:12 pm
I would like to be able to include the ResNet based version of ClipV as well as ViT, but I can't seem to find the correct way to port the weights for the final layer, so that is currently disabled (although the code remains within Faceswap for if/when I manage to resolve that issue).

I can't find not many references about this, is this a CNN/transformer hybrid or just a version of clip and vit trained on another dataset?
I see a lot of papers about different hybrid testing with CNN and different transformer aspects.

On the other side, papers always promise the world but real life often is less than ideal

It's an alternative Visual Transformer for Clip. Some weights use modified ResNet, some use ViT. Both exist in original implementation:
https://github.com/openai/CLIP/blob/a1d ... del.py#L94

Ryzen1988 · Post by **Ryzen1988** » Sun Aug 13, 2023 8:16 pm

torzdf wrote: ↑Sat Aug 05, 2023 6:12 pm
Glad you are liking it. @bryanlyon did original implementation. I then fully reworked it to work with Phaze-A and to train correctly.
The presets are not fully tested, beyond them working, and are more there as a demonstration of structure.

I have come around to training with the default presets, and seems to work as a charm as well. I really like the Clip/Vit magic sauze.
I always cheap out and go for hybrid, upsample but every time you just have subpixel it feels like it is just a bit better than all the vram tradeoff options.

I was wondering about a thing regarding the presets, comes really from a place of curiosity.
With the 2 smaller presets of 128 & 256 px you the dense layer output 4 x 4 x 1024, and have an upsample layer to 8 x 8 x 512 and go down the decoder from there.
With the 448 px preset the dense layer outputs 7 x 7 x 384 and up samples to 14 x 14 x 1024 and go down the decoder from there.

What was the reasoning for having the two smaller presets go wide out of the dens layers and then decrease filters as res increases, but for the large preset going narrow from the dense layers with not that much filters and then upsample it to higher res and higher filter count.
It just caught my eye, the rest except the scale being very similar but in this part so different.

Post by **torzdf** » Sun Aug 13, 2023 10:40 pm

Honestly? I gave it nearly zero thought on the presets, I just wanted to get some presets in with the new encoder that would scale across differing amounts of VRAM, and helped demonstrate not needing the 'bottleneck' dense.

I ultimately just checked they would start resolving at the required VRAM and left it at that. Porting the transformers was quite the undertaking, so by the time I was making presets I was fairly burned out on it.

Ryzen1988 · Post by **Ryzen1988** » Mon Aug 14, 2023 11:38 pm

Haha cool, tnx for the honesty. I will remain a little tweaker and tinkerer
now just training with a 7 x 7 x 512 reshape, upscale to 14x 14 x 512 and decoder starting from 512. looked more elegant but was a bit in doubt if there where any diehard fundamentals i was missing

Post by **torzdf** » Mon Aug 14, 2023 11:48 pm

Sorry, there is a bit more logic to it than I first said....

Ultimately, the 128 and 256 versions are fairly easy to do the upsamples for (1-2-4-8-16-32-64-128-256).

The final preset I wanted to be 2x the input res (so 448px output). This is trickier, as the base number is 7 (7-14-28-56-112-224-448). So you don't have a lot of flexibility with how you can structure your output from the bottleneck.

Ultimately you need to have the reshaped dimensions be correct out of the bottleneck if you don't want to do some pretty nasty jiggery-pokery reshaping in the decoder.

When working with shaping the bottleneck output, you are trying to avoid blowing out the parameters too much. With a 4x4x1024 you are looking at 16,384 nodes. For the version that ends at 448px, you are forced to have 7x7 as the minimum dimension size. If I were to leave the depth at 1024, there would be huge parameter blowout (7x7x1024 = 50,176 or nearly 3 times as many nodes as the 4x4 version).

My VRAM is limited to 11GB, so I wanted to keep the number of nodes as close to possible the smaller versions. Even 7x7x512 would be 25,088 nodes, which is almost double. 7x7x384 brought me to a happier medium (18,816 nodes)

Also (7 x 7 x 384) / 448 = 42. Does this matter? Probably not, but I liked it.

Ryzen1988 · Post by **Ryzen1988** » Thu Aug 17, 2023 10:28 am

torzdf wrote: ↑Mon Aug 14, 2023 11:48 pm
Also (7 x 7 x 384) / 448 = 42. Does this matter? Probably not, but I liked it.

Haha very nice.

on another note, now been training my third version with the 448 preset, with some differences in the dimensions of the FC and decoder wise, but all seem to come up to a point in training where not much improves anymore. Its from 100000+ its that the loss curve seems to flatten and not much progress is seen anymore in the quality of results. Has anyone else come across this problem? and if so what was the solution?
could be that i am to impatient of course

Post by **torzdf** » Thu Aug 17, 2023 10:34 am

I hit a similar issue (progressed fast, then seemed to struggle with drilling down on detail). However, I'm not sure, at this stage, whether this is a problem with ClipV, whether adding a 512 Dense layer may help, or whether something else to do with my model structure.

Ryzen1988 · Post by **Ryzen1988** » Thu Aug 17, 2023 11:22 am

Good to know, i will also experiment to find a way around, I already tried to use a 7x7x768 one up sample to 14x14x768 and then go in the decoders 768 filters back down to 64 in the end.
I noticed that the result became somewhat blocky if that is even a term, it felt still somewhat low res.
Currently i am trying a new one with the Clip334 -> 8x8x768 and one up sample to 16*16/768 and from there to the decoder 768 ending with 96. And output resolution of 512.
Maybe the amount of information in the small clip output 512 was to little information for 448 with high detail?

On a side note, i see the trap i always fall into myself.
Where you start to think about the reason might be the bottleneck.
I start to think bigger, more = better and then after a certain while i start to go like, this training time is not really practical

Post by **torzdf** » Thu Aug 17, 2023 1:01 pm

Well, my logic is... all things being equal, the decoder doesn't change, it is only the encoder which is different. Now, we may need to rework the Decoders to better handle the output from the transformer, however, that is outside of the scope of what we can change within Phaze-A. The encoder is also set so, realistically, all we can really play with here is the bottleneck/fully-connected layers.

Ryzen1988 · Post by **Ryzen1988** » Thu Aug 24, 2023 8:41 am

After doing a shitty load of experiments i have found that taking the Dyn preset, switch out the encoder to the Clipv, remove the bottleneck is a really solid way.
Especially if you freeze the encoder in the beginning its super lightweight and trains very rapidly.
This seems at least for me more reliable for success, tried different things with scaling of the latent layers with the normal preset but never seem to get to a point of satisfaction.

: faces 2.JPG (76.3 KiB) Viewed 21930 times

this is around 100000 its, but the fidelity seems to stall here.
So my next attempt is to get the dyn512 preset and combine it with the clipv334 encoder to maybe get more detail out of it.
Doing that i also increase the filters of the preset to 768 to match the encoder output.

rbanfield82 · Post by **rbanfield82** » Fri Aug 25, 2023 6:01 pm

Ryzen1988 wrote: ↑Thu Aug 24, 2023 8:41 am
After doing a shitty load of experiments..

Keep going with the testing Ryzen!

ianstephens · Post by **ianstephens** » Tue Aug 29, 2023 4:15 pm

I am also struggling to achieve more focus and clarity after a certain point. However losses are still dropping fairly nicely so I'm not too concerned yet. Hell, I guess I'm just used to efficientnetv2!

I will also be playing about with things and will report back!

MaxHunter · Post by **MaxHunter** » Tue Aug 29, 2023 5:55 pm

So what is the difference between cv-l (regular version) and cv-l 448 version? In what instance would I use one over the other?

I ask because I'm about to start a new Max-512 model and would like to try it out but I'm unsure which I should use.

Ryzen1988 · Post by **Ryzen1988** » Tue Aug 29, 2023 6:19 pm

The quickest to do a 512 model while retaining the preset is to increase the output to 512 and change the FC dimensions from 7 to 8.
The difference in the 448 preset is that the output FC-dimensions output 8x8 x 384 and then enlarge to 1024 prevent the amount of parameters getting excessive before going stepwize down.
The smaller presets are 4x4 x 1024 dimensions upsample to 8x8x512 and then goes stepwise down.
But you can play with it, my idea with a 512 was to do 8x8x512 -> go to 16x16x512 upsampling and then go sep wise down in the decoder.

Like always there is not an single most optimal setting, everything influence everything. Life
Up until now i have failed to get enough sharpness/details from a clipv_farl-b-16-64 encoder to be really satisfied with a 512 output but please try and share if succesfull.
The clipv_vit-l-14-336px is way bigger encoder but really puts a drag on the vram and processing usage.

Ryzen1988 · Post by **Ryzen1988** » Mon Sep 04, 2023 4:58 pm

So when reading at the Farlb github where the Clipv-farl transformer models come from.
They also seem to have a ''State-of-the-art face alignment and face parsing model''

Would this possible be a good next gen solution instead of fan and s3fd and Bisenet FP combined?

I have noticed that in swaps of long content that is say 15 minutes long i still need to spend a lot of time of manually correcting alignments, just random missing the face in what seems to be pretty much the same as the face before or after and stuff.
Could see why Clipv would be very adaptable for face alignment/parsing.

Faceswap Forum

Yah new sutff, a vision transformer CLipV

Yah new sutff, a vision transformer CLipV

Re: Yah new sutff, a vision transformer CLipV

Re: Yah new sutff, a vision transformer CLipV

Re: Yah new sutff, a vision transformer CLipV

Re: Yah new sutff, a vision transformer CLipV

Re: Yah new sutff, a vision transformer CLipV

Re: Yah new sutff, a vision transformer CLipV

Re: Yah new sutff, a vision transformer CLipV

Re: Yah new sutff, a vision transformer CLipV

Re: Yah new sutff, a vision transformer CLipV

Re: Yah new sutff, a vision transformer CLipV

Re: Yah new sutff, a vision transformer CLipV

Re: Yah new sutff, a vision transformer CLipV

Re: Yah new sutff, a vision transformer CLipV

Re: Yah new sutff, a vision transformer CLipV

Re: Yah new sutff, a vision transformer CLipV

Re: Yah new sutff, a vision transformer CLipV

Re: Yah new sutff, a vision transformer CLipV

Re: Yah new sutff, a vision transformer CLipV

Re: Yah new sutff, a vision transformer CLipV