[Guide] Introducing - Phaze-A

Want to understand the training process better? Got tips for which model to use and when? This is the place for you


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for discussing tips and understanding the process involved with Training a Faceswap model.

If you have found a bug are having issues with the Training process not working, then you should post in the Training Support forum.

Please mark any answers that fixed your problems so others can find the solutions.

User avatar
torzdf
Posts: 2636
Joined: Fri Jul 12, 2019 12:53 am
Answers: 156
Has thanked: 128 times
Been thanked: 614 times

Re: [Guide] Introducing - Phaze-A

Post by torzdf »

No, you cannot change these options mid-train. Once these options are set they are "fixed".

My word is final


Tags:
User avatar
zany6669
Posts: 2
Joined: Fri Sep 18, 2020 10:50 pm

Re: [Guide] Introducing - Phaze-A

Post by zany6669 »

torzdf wrote: Tue Aug 30, 2022 10:01 am

No, you cannot change these options mid-train. Once these options are set they are "fixed".

Thanks, I suspected as much but good to have the confirmation

User avatar
tochan
Posts: 21
Joined: Sun Sep 22, 2019 8:17 am
Been thanked: 5 times

Re: [Guide] Introducing - Phaze-A

Post by tochan »

Hi folks,

after a long time i come back to faceswap ;).

I tested the phaze-A "disney 512 model " in default settings from the config file.

Here some experinces:

Sys intel 12700KS, 64GB Ram, Nvidia 3090 FE.

Batchsize 16 works fine (36-40 Egs/sec)

after 575.000 iterations, the model run in a NaN error (Lass A/B around 0.03).

Same after restart training from snapshots 550.000 and 525.000. NaN in region 575.000.

i don't wan't loss the the training trime so i start some experiments with the settings they are changeble:
at this moment, the model runs 630.00 Iterations. i train the last snapshot 550.000 with "no flip" and "no Warp". Traintime one night (25.000 iter.) Then i change back to "normal" with flip and warp. Model losses go higher after change to normal (0.07). Now is 0.045 and 630.000 iter.

i'm exited to see what happend in the region 0.03.

Some other experience here to fix a NaN model?

sorry for my bad english skills. ;)

User avatar
torzdf
Posts: 2636
Joined: Fri Jul 12, 2019 12:53 am
Answers: 156
Has thanked: 128 times
Been thanked: 614 times

Re: [Guide] Introducing - Phaze-A

Post by torzdf »

NaNs are annoying, but they happen with ML.

You can read about some mitigation steps here:
viewtopic.php?t=2177

My word is final

User avatar
martinf
Posts: 27
Joined: Thu Sep 29, 2022 7:58 pm
Been thanked: 3 times

Re: [Guide] Introducing - Phaze-A

Post by martinf »

I tried the setup that had the fully shared FC and the identity bleed was ridiculous. The progression from A to B after 60K iterations looked like I had simply applied a gaussian blur to the original image. There was no indication that a swap was even taking place.

This is frustrating to just have to poke around inside what is essentially, for me, a black box to see what things do.

User avatar
MaxHunter
Posts: 193
Joined: Thu May 26, 2022 6:02 am
Has thanked: 176 times
Been thanked: 13 times

Re: [Guide] Introducing - Phaze-A

Post by MaxHunter »

tochan wrote: Wed Aug 31, 2022 6:47 am

i'm exited to see what happend in the region 0.03.

I'm trying to get around a NaN as well. How did this experiment work out for you?

User avatar
HoloByteus
Posts: 19
Joined: Mon Apr 12, 2021 11:31 pm
Been thanked: 3 times

Re: [Guide] Introducing - Phaze-A

Post by HoloByteus »

Been working on learning how to read the model summaries and doing a bit of poking in the dark reverse engineering. Noticed that the default Phase A model seems to have been designed for use with the fs_original encoder with its parameters while the Stojo model looks to be designed for Efficinet. Thought I'd see what I could do to lighten it up a bit on the VRAM side since I have trouble training with it in the background.

I played with a number of parameters but didn't like that I was losing 20K of parameters with some changes. Another guess I've made is that ideally, you'd want Encoder and Decoder parameters to be as close as possible.

I elected to give the following a try with some mods to StoJo. Increased Scaling to 67% to get the full 256px input. Disabled G-Block to buy back some VRAM and possibly some speed. Move to Dyn as the decoder upscale method which achieved the encoder/decoder parameters alignment, would just seem to be more efficient. With that we have StoDyn 256.

Model: "phaze_StoDyn 256"

Code: Select all

============================================================
Layer (type)                    Output Shape          Param #     Connected to
============================================================
 face_in_a (InputLayer)             	(None, 256, 256, 3)  0      
 face_in_b (InputLayer)             	(None, 256, 256, 3)  0      
    
encoder (Functional)                  	(None, 512)             62274912
	
fc_a (Functional)                       	(None, 16, 16, 1280)  42024960 
fc_b (Functional)                       	(None, 16, 16, 1280)  42024960 

decoder_both (Functional)       	(None, 256, 256, 3) 41925539
============================================================
Total params: 188,250,371
Trainable params: 188,096,499
Non-trainable params: 153,872

StoDyn 256 is clearly going to be a 2w model on a 3060... 😩 I wanted something a bit faster and good for mid-ground swaps. Here's a poke at a 128px efficient model with pretty tight parameters.. StoDyn 128. Tested up to BS24 with 12GB although I usually stick with BS16 to start.

Model: "phaze_StoDyn 128"

Code: Select all

============================================================
Layer (type)                    Output Shape          Param #     Connected to
============================================================
 face_in_a (InputLayer)              	(None, 128, 128, 3)  0
 face_in_b (InputLayer)              	(None, 128, 128, 3)  0

 encoder (Functional)                  	(None, 1024)          41303904

 fc_a (Functional)                        	(None, 8, 8, 640)   35738880
 fc_b (Functional)                        	(None, 8, 8, 640)   35738880

 decoder_both (Functional)       	(None, 128, 128, 3)  35379619 
============================================================
Total params: 148,161,283
Trainable params: 148,007,411
Non-trainable params: 153,872
Attachments
PhazeA_StoDyn.7z
Phase A StoDyn 128x128px & 256x256px Models
(805 Bytes) Downloaded 557 times
User avatar
martinf
Posts: 27
Joined: Thu Sep 29, 2022 7:58 pm
Been thanked: 3 times

Re: [Guide] Introducing - Phaze-A

Post by martinf »

The 256 Stojo Dyn model is crashing here on a batch size of 3, Windows 10, 3080Ti.

And, after it crashes, the preview function of Faceswap stops working altogether until I reinstall. Not sure if this is an edge case or what.

User avatar
HoloByteus
Posts: 19
Joined: Mon Apr 12, 2021 11:31 pm
Been thanked: 3 times

Re: [Guide] Introducing - Phaze-A

Post by HoloByteus »

I don't think you mean FS is crashing but the training is exiting with an OOM error correct? I can't imagine running a model would corrupt the FS install and BS 3 should be doable, I think I used BS6 on a 3060/Win11 with a 1080p desktop. I did have to disable both initializers which I usually use (which are off by default) since they also tax VRAM and run the standard loss functions SSIM/MSE but that's also with running a browser in the background. I can guess previews might stop working if VRAM is fragmented or just not available although a reboot should fix that.

HTPC: Windows 11 Pro, 12700K, 32GB, MSI Z690 Carbon, Strix 3060, 2x2TB FireCuda 530, 2x18TB Seagate EXO.
AV: Denon X3800H, Outlaw 7000X Amp, Hisense 65H9F, Panamax MR4300

User avatar
martinf
Posts: 27
Joined: Thu Sep 29, 2022 7:58 pm
Been thanked: 3 times

Re: [Guide] Introducing - Phaze-A

Post by martinf »

Not just OOM, but a laundry list of errors involving the keras decoder. What I did learn is that starting training back up and letting it run for a few iterations clears whatever was screwing up the preview function. Prior to doing that the preview would hang no matter what project/model I would feed it.

Had to abandon this model for now (StoDyn 256). Even would cease with a batch size of 2. Not sure what the exact issue is. This is Win 10, 3080Ti and 80g of RAM and a 5950X. I can get the errors to you if you would like. Think I still have the log for it.

Gonna send them via private MSG. Maybe you can take a look a sort it out.

User avatar
HoloByteus
Posts: 19
Joined: Mon Apr 12, 2021 11:31 pm
Been thanked: 3 times

Re: [Guide] Introducing - Phaze-A

Post by HoloByteus »

It's likely available VRAM is going to be your issue. Can you run Stojo? You could try dropping scaling to 60 as well to lighten the load making the only difference between this model and stojo being the use of Dyn as the decoder upsale method and dropping of G-Block.

HTPC: Windows 11 Pro, 12700K, 32GB, MSI Z690 Carbon, Strix 3060, 2x2TB FireCuda 530, 2x18TB Seagate EXO.
AV: Denon X3800H, Outlaw 7000X Amp, Hisense 65H9F, Panamax MR4300

User avatar
martinf
Posts: 27
Joined: Thu Sep 29, 2022 7:58 pm
Been thanked: 3 times

Re: [Guide] Introducing - Phaze-A

Post by martinf »

It got to where anything I tried to run from Phaze-A was corrupted. Stojo gave me nothing but bright squares then a NAN after about 1500 iterations at batch size of 3. Then... other models also started to crap out. Powered down and turned power supply off for 30 seconds and restarted. Other models were working again. I am not overclocking but it appears that something got into my GPU that was persistent until I booted up again.

Steering clear of Phaze-A for the time being. Too many mistakes to be made it appears and I have no idea which settings that I change are the culprits.

User avatar
HoloByteus
Posts: 19
Joined: Mon Apr 12, 2021 11:31 pm
Been thanked: 3 times

Re: [Guide] Introducing - Phaze-A

Post by HoloByteus »

My first thought if you were having issues with StoJo was are you overclocking but that appears to not be the case. If you stick with the Phase A defaults or presets you shouldn't have issues. If you can't run those successfully, I'd run some health checks. Funny thing with GPU's they can seem perfectly functional with issues only showing when put to heavy task and with 12GB only the more demanding apps will use that. It's possible there could be an issue with the FS install, card or drivers. Sounds like you've already tried reinstalling FS so I'd hunt down some GPU test tools and run the gambit. Could just need a driver clean/reinstall.

Edit: There appears to be a bug with Efficienet v1 & TF which is set on the StoJo preset that can cause a model to crash. You just need to switch to Efficienet v2-S to resolve for now. I'd already been using v2-S since it became available, dodged that bullet. :D

HTPC: Windows 11 Pro, 12700K, 32GB, MSI Z690 Carbon, Strix 3060, 2x2TB FireCuda 530, 2x18TB Seagate EXO.
AV: Denon X3800H, Outlaw 7000X Amp, Hisense 65H9F, Panamax MR4300

User avatar
Yaboyscotty
Posts: 3
Joined: Thu Jul 18, 2019 8:23 am

Re: [Guide] Introducing - Phaze-A

Post by Yaboyscotty »

I'm brand new to Facewsap, I was thinking of running a phaze-a pretrain or whatever it would be called. I'm wondering is it a stable model to use at 320 res? I'm coming from LIAE-UDT in DFL. So I'm not sure if training times would be comparable either.

User avatar
torzdf
Posts: 2636
Joined: Fri Jul 12, 2019 12:53 am
Answers: 156
Has thanked: 128 times
Been thanked: 614 times

Re: [Guide] Introducing - Phaze-A

Post by torzdf »

It's stable, only insofar as if you setup correct settings to work at that resolution.

Output resolution is just 1 very small part of the pie.

My word is final

User avatar
Yaboyscotty
Posts: 3
Joined: Thu Jul 18, 2019 8:23 am

Re: [Guide] Introducing - Phaze-A

Post by Yaboyscotty »

torzdf wrote: Wed Nov 16, 2022 10:20 am

It's stable, only insofar as if you setup correct settings to work at that resolution.

Output resolution is just 1 very small part of the pie.

Is there a preset type of thing equivalent to a model in DFL that's available for download? Basically something with a set of params already enabled?

User avatar
Ryzen1988
Posts: 57
Joined: Thu Aug 11, 2022 8:31 am
Location: Netherlands
Has thanked: 8 times
Been thanked: 27 times

Re: Notes on Phaze A model architecture and settings

Post by Ryzen1988 »

Icarus wrote: Mon Aug 15, 2022 10:13 pm

I've been experimenting with Phaze A for a year now using Nvidia A100 cloud GPUs and have tried a few common and 1 not so common setup and wanted to share some of my notes on how different model architectures effect results.

split fc layer, gblock enabled (not split), shared decoders:
This is probably the most popular setup and is the best choice if your A data has a lot of poses / angles that your B data lacks. The shared decoder is really good at filling in the blanks, however there tends to be a fair amount of identity bleed.

shared fc layer, gblock enabled (not split), split decoder:
This has produced the worst results and in my experience has been the only setup to cause discoloration in the forehead when the hairlines differ.

split fc layer, gblock enabled (not split), split decoder:
This is the least common setup but my personal favorite when you have a good amount of B data. This setup results in strikingly accurate detail and is the closest thing to actually swapping the face with 0 identity bleed. The downside to this setup is when you don't have enough B data to fill in the blanks. The model does some frightening things when it only has the G-block (a GAN) to fill in the blanks.

I did a few experiments with a split g-block but I didn't really notice any significant improvement or degradation either way...

A few notes on what I've found to be ideal settings: :
Encoder: Efficientnet2_L has been amazing and I've noticed a huge improvement over v1. I usually try to match the scale with the output.

Bottleneck: Always go with Dense, I've tried both poolings and they result in streaks of color and poor detail. Using the 512 size has never let me down.

fc layer: overcranking this can do more harm than good. With autoencoder models, you generally don't want this to be more detailed than the encoder feeding it. I've noticed better results with a dimension of 8 and 1280 nodes than with a dimension of 12 or 16. On that note, making this deeper (increasing the depth over 1), is unnecessary, a waste of VRAM and, at least in my experience, did nothing to improve the results and may have made them worse.

fc_dropout: I hardly ever use this but the 1 time I did, it surprisingly sped up training massively (which seems counterintuitive).

Upsamplers: Since I was doing most of the training on a powerful GPU, I used subpixel for both upsamplers. I would say 512 is probably a decent amount of filters. I had this at 1280 but eventually dropped it to 512 and didn't notice any degradation in results.

Decoders: Allocating more VRAM to these parameters will give you the most bang for your buck in terms of detail in the results. I noticed a huge increase in detail and quality by increasing both first and final amount of filters. If you run into VRAM issues, adjusting the slope of the filter curve (making it steeper) can save you Vram. Adding an additional residual block (or 2) also made a huge difference. I go with kernal size of 3 but have also used the default of 5 a few times and it's hard to say if it made much of a positive difference because other parameters were also changed.

Loss functions: :
As it says in the Training Guide, the choice you make here will have an outsized impact on your entire model. I've tried the all and a combination of MS_SSIM and MAE (L1) at 100% have produced the best results. The weird quirk with MS_SSIM is whenever I've tried to start a model using it, my model crashes (which I honestly can't explain. So I usually start with SSIM then swap it out for MS_SSIM after 1k iterations. I also add a 3rd loss function, ffl at either 25% or 50% and I think it has made a positive impact. I've tried the lpips as tertiary losses and it completely ruined everything with the moire pattern described in the settings. I get that in theory using one of those as a supplimentary loss is supposed to help but I have no idea of how much weight to give it.

Mixed Precission: :
Last but not least, Mixed Precision. You love it and you hate it. It does make a huge difference in training speed and VRAM but is the frequent culprit of NaNs. I did some research on Nvidia's website regarding this and I found the holy grail of hidden information that has cured me of the downside to using it. It all comes down to the Epsilon. Nvidia recommends increasing your epsilon by 1e-3 when training with Mixed Precision. So instead of the default 1e-07, I use 1e-04 and this has made the world of difference with 0 downside in terms of the models ability to learn and most importantly no more NaNs.

These are just a few things I've noticed after experimenting a bit through trial and error and these findings are by no means scientific and would never pass a peer review :P

I usually train until loss convergence and to around 600k - 800k iterations with a batch size of 8 and learning rate of 3e-05.

I have not been at the forum for a while, but was doing my own route of testings, when i read your post i was very intrigued because things like MS_SSIM i also always prefere.
Often i don't even use a L2, but when i do its log cosh mostly because it's mathematically a better option then MSE or MSA, but if it really make much difference in reality i don't know for certain
But some conclusions about the models i had made where completely contradictory.
So very fascinating to read you're detailed report and i would like to add some of mine insights and still some open questions.
My used GPU is a RTX a6000 so also not really vram limited like u, although the a100 is of course somewhat of the endboss of gpu's ;)

For me the preferred go to layout was shared fc layer, gblock enabled (not split), split decoder: models.
After i read your post i reevaluated the other setups and i can definitively see the multiple ways to rome, so by no means is my findings critique or something.

After testing out you're personal favorite i immediately understood why that model is also very powerful for other reasons than my choice for the shared FC

  1. I prefer some identity bleed, it often makes quicker a believable result, its result is less detail than you're approach, although i wonder since having only the shared FC saves a lot of parameters if that partially could not be compensated for with using part of the savings for a ''bigger'' model.
    I also prefer this form of identity bleed above the identity bleed of a shared decoder, that for me always worked less ideal in the final result.

  2. The G-blok is really the thing that makes this option work, without the G-blok i have lots of luminosity inconsistencies frame to frame but the G-blok really seems to stabilize the temporal aspect of the lighting.
    I'm still fully in doubt and testing what the effect is between split or not and how many layers in the G-blok has what effect on results. Really something that requires 2 parallel training sessions with only changes in the G-blok settings since the effects are less pronounced.
    About the results and forehead issues, i often export a base extended mask, a somewhat soft bisnetFP and a one thats almost without any soft and erosion, sometimes one with histogram, blend the results to make the mask edges blend in. Sometimes i do the original clip on top of that as a colour layer at 50% to get the result very close to original.

Your method with mixed precision works well except... and found this also to be the same reason some of the other encoder options do not function that well is when using adabelief instead of adam. I prefer Adabelief over adam as to be willing to make the tradeoff that mixed precision is not great to use. (if someone has found ways to have mixed FP with adabelief for the whole training would love to know)

Your findings about FC dimensions is the one thing i really feel contradictory about. My own testing and usage has drifted more and more to an FC Dimension of 16, the only other thing i feel strong about that in combination with this i always use the normal FS encoder and tune its depth so that the main network is:
Encoder output - 16x16x(scalable amount) Dense layer of 512-1024 to 16x-16x (scalable amount). Depending on how much identity bleed i want i do 1x upsampling in the FC or 3 x upsample if i want more identity bleed (useful when very different faces, or just want to have an acceptable result quicker).
The 16 x 16 x filters also allows for a short encoder (partially depending of input res) of 5 steps, same with lesser amount of upscale in the FC upsample or decoder upscale when the output is already 16 x 16 x filters. I really believe that this short or shallow depth makes for a good result way quicker than with a deep pipeline.
I used dropout in the beginning but it's better to use normalization from what i read in papers, i use Layer norm on the bottleneck and started to use group norm instead of layer on the dec norm.

Hope this contribution is useful, it's certainly cool to see the multiple gradient roads to rome.
The possibilities with Phaze A keep on suprising.

User avatar
torzdf
Posts: 2636
Joined: Fri Jul 12, 2019 12:53 am
Answers: 156
Has thanked: 128 times
Been thanked: 614 times

Re: [Guide] Introducing - Phaze-A

Post by torzdf »

Yaboyscotty wrote: Thu Nov 17, 2022 6:33 am
torzdf wrote: Wed Nov 16, 2022 10:20 am

It's stable, only insofar as if you setup correct settings to work at that resolution.

Output resolution is just 1 very small part of the pie.

Is there a preset type of thing equivalent to a model in DFL that's available for download? Basically something with a set of params already enabled?

Yes. Presets (and how to access them) are shown in the 3rd post in this thread:
viewtopic.php?p=5367#p5367

My word is final

User avatar
torzdf
Posts: 2636
Joined: Fri Jul 12, 2019 12:53 am
Answers: 156
Has thanked: 128 times
Been thanked: 614 times

Re: Notes on Phaze A model architecture and settings

Post by torzdf »

Ryzen1988 wrote: Thu Nov 17, 2022 9:29 am

Hope this contribution is useful, it's certainly cool to see the multiple gradient roads to rome.
The possibilities with Phaze A keep on suprising.

It very much is, thank you for taking the time :)

Have added this to the quick links on discussions.

My word is final

User avatar
Ryzen1988
Posts: 57
Joined: Thu Aug 11, 2022 8:31 am
Location: Netherlands
Has thanked: 8 times
Been thanked: 27 times

Re: [Guide] Introducing - Phaze-A

Post by Ryzen1988 »

:lol: :lol: Oke, after some more thinking over previous post of mine and others in this thread i more and more realized that at least for my part, the choices i make in setting up phaze is part theoretical knowledge, part experience but also part.... as the germans say: fingerspitzengefühl.
Now that last part can guide you in a direction and choices that can really work counterproductive when not often checked with reality.

Especially the detailed descriptions of mr Icarus and some contradictions with mine really made me wonder if some of my preconceived notions where actually correct and especially around the unknown regarding the G-blok that i often just have in by default.
Besides mr Icarus uses a100, and i only use a6000. Thats like me driving a toyota and he comes driving up next to me in a monster truck :geek: makes you question things in life :ugeek:

So i had 2 rtx a4000 laying around and i wanted to do some real A - B testing without having to sacrifice my workstation.
So i did some small testing with those a4000 last couple of days to just check if and what effect things have on the result.
I picked a fairly simple clip to face swap, and a fairly limited face set with a for me ''default'' setup, trained and export. Then changed a single setting, trained a new network and export.....
Training was done with a fairly low resolution 256px in and 256px out. fairly constrained specs of the network itself and did 60000 Its with a batch size of 8. Of course to prevent spending weeks on this. Yes the face mask is ugly and not well blended, just did a average colour and used the same settings on every export to keep consistency.
Might not represents the best that's possible but do makes a interesting comparison.

Don't take this as absolute results, it's just an observation early in training with a very constrained network but some interesting observations nonetheless.
All reference pictures are added to this post

So my default was a simple network with just split decoders, Adabelief, Layer norm and subpixel.
The same as above but with G-blok enabled, after that the same as above with Split G-blok.
A version with separate fully connected & shared FC with a single decoder.

Also did a run of the default network with pixelshuffler instead of subpixel, and a run with Adam instead of adabelief.
To prevent me taking an entire page with lots of pics here is a link to a google photo share.
couple of comparison pictures

Like i said, some interested results regarding eyes, and especially with the gblok and facial expressions :o :? :lol: But do watch the pictures.
It at least got me to the point that i am gonna re-run and do more extensive testing because there might be happening things i would have not expected.
The A4000 are limited with there 16gb vram and amount of horsepower, but think gonna up the res to 384 in/out and a more beefy network.
Let it run for 150000 iterations. Testing the previous setups but will include G-blok with shared FC + single decoder setup, also think gonna include a 2d upsample comparison.
Lastly really want to do a comparison with a small FC dimensions vs large FC dimensions fom the hidden layers.

Sorry for the long ramble will probably take some time for me to run the more extensive testing but hope you enjoyed looking at the funny pictures.

Post Reply