Potential VRAM Saving techniques

Want to understand the training process better? Got tips for which model to use and when? This is the place for you


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for discussing tips and understanding the process involved with Training a Faceswap model.

If you have found a bug are having issues with the Training process not working, then you should post in the Training Support forum.

Please mark any answers that fixed your problems so others can find the solutions.

Post Reply
User avatar
torzdf
Posts: 1855
Joined: Fri Jul 12, 2019 12:53 am
Answers: 136
Has thanked: 80 times
Been thanked: 376 times

Potential VRAM Saving techniques

Post by torzdf »

[Topic split from: viewtopic.php?t=2058]

I have done a not inconsiderable amount of testing around NaNs recently, really hoping I would find a solve, and I have drawn a blank. Where they get introduced in a model is inconsistent, although they nearly always seem to appear at some point in the forward pass, which clipping would not be able to help with.

The sad fact is that Mixed Precision does increase the chance of NaNs, that is just the nature of using a more limited numerical range. My focus, more recently, has been on looking at other ways to reduce VRAM usage, to easier enable Full Precision training, but Tensorflow is making this particularly hard for me. Method's I could easily introduce in TF1.x have been totally disposed of since TF2.x, and implementing any are somewhere between very difficult and impossible. If I could start this project again I would, undoubtedly, use PyTorch, where implementing these kinds of features is a LOT easier. TF devs just don't appear to care (about this and a myriad of other issues).

The other factor is that bigger and more complex models need lower learning rates. Latest models I'm developing need to be starting at between 1e-5 to 3e-5. That is also just a matter of fact.

Yet another factor feeding this is that batch size is proportional to learning rate. If you lower the batch size, you should lower the learning rate. This is kind of logical, as smaller batch-sizes mean that outliers will have a larger effect on gradients. As models get larger and more complex, lower batch sizes are unavoidable. There has been some recent research around this, which I have not yet had an opportunity to fully digest: https://arxiv.org/abs/2006.09092

My word is final


User avatar
ianstephens
Posts: 97
Joined: Sun Feb 14, 2021 7:20 pm
Has thanked: 11 times
Been thanked: 7 times

Re: AutoClip: Any Feedback From Users?

Post by ianstephens »

torzdf wrote: Mon Jul 18, 2022 1:47 am

I have done a not inconsiderable amount of testing around NaNs recently, really hoping I would find a solve, and I have drawn a blank. Where they get introduced in a model is inconsistent, although they nearly always seem to appear at some point in the forward pass, which clipping would not be able to help with.

Yes, I believed also the NaN happened before the chance the clipping processor had to deal with it.

torzdf wrote: Mon Jul 18, 2022 1:47 am

The sad fact is that Mixed Precision does increase the chance of NaNs, that is just the nature of using a more limited numerical range. My focus, more recently, has been on looking at other ways to reduce VRAM usage, to easier enable Full Precision training, but Tensorflow is making this particularly hard for me. Method's I could easily introduce in TF1.x have been totally disposed of since TF2.x, and implementing any are somewhere between very difficult and impossible. If I could start this project again I would, undoubtedly, use PyTorch, where implementing these kinds of features is a LOT easier. TF devs just don't appear to care (about this and a myriad of other issues).

I think this idea is the solution though. To be able to use a full 32-bit range would seem to be a resolution. As you say, reducing VRAM usage is the key stumbling block. There is no way we could run the models we do, in their current format without mixed precision. Even with the behemoth 3090 and 24GB of VRAM, we have to run at batch sizes of 4-5 on these large models. Would it be a mammoth exercise to port-over to PyTorch?

torzdf wrote: Mon Jul 18, 2022 1:47 am

The other factor is that bigger and more complex models need lower learning rates. Latest models I'm developing need to be starting at between 1e-5 to 3e-5. That is also just a matter of fact.

That's interesting. What do you set for your EE on these tests? Also, do you modify/increase the learning rate once the model becomes somewhat "stable"?

torzdf wrote: Mon Jul 18, 2022 1:47 am

Yet another factor feeding this is that batch size is proportional to learning rate. If you lower the batch size, you should lower the learning rate. This is kind of logical, as smaller batch-sizes mean that outliers will have a larger effect on gradients. As models get larger and more complex, lower batch sizes are unavoidable. There has been some recent research around this, which I have not yet had an opportunity to fully digest: https://arxiv.org/abs/2006.09092

Yes, like mentioned above - even with a 3090 and 24GB VRAM we have to use stupidly low batch sizes (4-5) on these large models. I will take a look at the link you provided.

Happy to donate some extra $ in order to port-over to PyTorch :lol:

Aside, the project is fab, and thank you for all you do.

P.s. still loving EfficientNetV2 :D


User avatar
torzdf
Posts: 1855
Joined: Fri Jul 12, 2019 12:53 am
Answers: 136
Has thanked: 80 times
Been thanked: 376 times

Re: AutoClip: Any Feedback From Users?

Post by torzdf »

ianstephens wrote: Mon Jul 18, 2022 9:15 pm

Would it be a mammoth exercise to port-over to PyTorch?

Yes, it would be huge. So much so that I would prefer to be buried in the quagmire of TF code than even attempt it at this stage.

ianstephens wrote: Mon Jul 18, 2022 9:15 pm

That's interesting. What do you set for your EE on these tests? Also, do you modify/increase the learning rate once the model becomes somewhat "stable"?

-5 for EE, although given what you have said -4 may be better. I just hate the idea of losing accuracy.

I never increase learning rate, only ever decrease it. Increasing it again doesn't really make sense in the realms of ML.

ianstephens wrote: Mon Jul 18, 2022 9:15 pm

Yes, like mentioned above - even with a 3090 and 24GB VRAM we have to use stupidly low batch sizes (4-5) on these large models. I will take a look at the link you provided.

Well, I have now added something which should save some VRAM, although not as much as Mixed Precision does, so it may not solve your issue.

I mostly added this for single GPU use, but it is a valid strategy for Multi-GPU use too. The current Multi-GPU (Mirrored) strategy copies the variables to all connected GPUs. This is faster as each GPU has its own copy and just updates locally.

Central Storage Strategy (https://www.tensorflow.org/api_docs/pyt ... geStrategy) is a technique which keeps the variables on a single device and updates them centrally. I have forced this to store variables on the CPU. This will free up some VRAM. Not a huge amount, but enough for me to raise the batch size a little in local testing.

The downside is it is not compatible with Mixed Precision training, for the aims that I am trying to achieve, that is not necessarily a bad thing, but it's worth bearing in mind.

For single-GPU use, I have actually hacked it so that it will support Mixed Precision training, but for Multi-GPU it's definitely a no-no for now. You may want to test this out (if you can get your model loaded at full precision), because it definitely does save some VRAM, and the increased overhead of host <-> gpu copies does not seem too bad in my limited testing. It also doesn't seem to force the strange loss printing errors forcing a crash that mirrored strategy introduces.

I have another technique for limiting VRAM usage in development at the moment, but this is a lot more involved and is likely to take me much longer to implement (if I manage to do so at all). However, it should lead to vastly improved VRAM usage at the cost of speed.

My word is final


User avatar
ianstephens
Posts: 97
Joined: Sun Feb 14, 2021 7:20 pm
Has thanked: 11 times
Been thanked: 7 times

Re: Potential VRAM Saving techniques

Post by ianstephens »

So I've been playing with some models, different settings, etc.

I've been using your new DNY models as you are familiar with these and provide a better understanding when reporting back.

Testing on a single 3090 (24GB VRAM), I am able to load your 1024 px DNY model @ a batch size of 4 max at FP. I haven't tried MP but of course, it would be higher. I am (also) trying to stay away from MP due to the NaNs. I am running an EE of 7 with a learning rate of 2e-05 for this particular model - would you agree?

I have also been playing with "central storage". While it does enable me to increase the batch size so very slightly, I am not a fan of it as I've noticed at least a 40-50% performance (speed) hit when it comes to iteration times.

The transfers to/from the CPU <--> GPU are far too slow as opposed to the GPU having it right there in memory. I don't think it's worth the performance hit in comparison to the small increases in batch size. However, If the increase in batch size were quite a bit more substantial I think the performance hit would be worth it.

On a side note, when running your DNY model @ 1024px we get the following error in the log window:

Code: Select all

DecompressionBombWarning: Image size (118638000 pixels) exceeds limit of 89478485 pixels, could be decompression bomb DOS attack.

I have another technique for limiting VRAM usage in development at the moment, but this is a lot more involved and is likely to take me much longer to implement (if I manage to do so at all). However, it should lead to vastly improved VRAM usage at the cost of speed.

I will look forward to this and it sounds fab :D

If you need me to test anything at my end then feel free to ask. I always pop in every few days to catch up :)

As a side note - the new preview window (I think matplotlib) is problematic. It's fine initially but after 60,000 iterations it's so laggy it's unusable. We have to constantly switch to/from focus on it to get a preview. It's just so laggy after many iterations. We have resorted back to using the preview windows directly on the FS GUI. Perhaps there is some missing garbage collection etc function missing - I don't know - you are the expert :)


User avatar
torzdf
Posts: 1855
Joined: Fri Jul 12, 2019 12:53 am
Answers: 136
Has thanked: 80 times
Been thanked: 376 times

Re: Potential VRAM Saving techniques

Post by torzdf »

ianstephens wrote: Fri Jul 22, 2022 8:13 pm

Testing on a single 3090 (24GB VRAM), I am able to load your 1024 px DNY model @ a batch size of 4 max at FP. I haven't tried MP but of course, it would be higher. I am (also) trying to stay away from MP due to the NaNs. I am running an EE of 7 with a learning rate of 2e-05 for this particular model - would you agree?

That should be fine. If you do train the model to any distance and have anything you can share, I would certainly be interested to see. Sadly I can't give up the compute time whilst I work on other things.

I have also been playing with "central storage". While it does enable me to increase the batch size so very slightly, I am not a fan of it as I've noticed at least a 40-50% performance (speed) hit when it comes to iteration times.

That's not so great. Had hoped the performance hit would not be that high. I guess it's more useful for edge cases to load models which would otherwise not be loadable (A measure of last resort).

However, If the increase in batch size were quite a bit more substantial I think the performance hit would be worth it.

The current solution I'm looking at will be more in this area... if it is even implementable in TF2.0. Rest assured it is something I am looking at, but I am still not convinced I will be able to implement it. Time will tell.

On a side note, when running your DNY model @ 1024px we get the following error in the log window:

Code: Select all

DecompressionBombWarning: Image size (118638000 pixels) exceeds limit of 89478485 pixels, could be decompression bomb DOS attack.

Thanks. Will look into this when I get a second. It is just a warning, so can probably be ignored. I may look to intercept it and output a more useful warning and actions a user can take. Basically, at default preview display you are getting a 12,288x7,168px image which is, frankly, insane. I would advise reducing the number of images in the preview display (can be done changed in Train Settings > Trainer > Number of preview images)

As a side note - the new preview window (I think matplotlib) is problematic. It's fine initially but after 60,000 iterations it's so laggy it's unusable. We have to constantly switch to/from focus on it to get a preview. It's just so laggy after many iterations. We have resorted back to using the preview windows directly on the FS GUI.

Thanks, will look at this too. I don't generally use the pop-up preview (I use the GUI Preview, and save out images on occasion when I want to examine more closely). These kind of issues which don't come up until a period of time has passed tend to get missed.

My word is final


User avatar
ianstephens
Posts: 97
Joined: Sun Feb 14, 2021 7:20 pm
Has thanked: 11 times
Been thanked: 7 times

Re: Potential VRAM Saving techniques

Post by ianstephens »

A quick one... and I will reply to this post above very soon... however...

What input image sizes do you use for your DNY 1024px model? We have been running extracting at 1024px - agree?


User avatar
torzdf
Posts: 1855
Joined: Fri Jul 12, 2019 12:53 am
Answers: 136
Has thanked: 80 times
Been thanked: 376 times

Re: Potential VRAM Saving techniques

Post by torzdf »

No, I would go higher. I have the maths somewhere....

Basically, if you are training on Face Centering at 100% you'd want to multiply the model input size by 1,33% (this is how much extra space is padded on to extracted images for head centering).

So 1024 (model output size) * 1.333 = 1366px (the size of a full extract image which holds a 1024px face centered image). If you are doing coverage below 100% you would need to up the ratio accordingly.

My word is final


User avatar
ianstephens
Posts: 97
Joined: Sun Feb 14, 2021 7:20 pm
Has thanked: 11 times
Been thanked: 7 times

Re: Potential VRAM Saving techniques

Post by ianstephens »

Yes, you're right - what was I thinking. Was already aware of this padding from previous (smaller) models. Having a slow day (or week) :)

I'm going to run extraction @ 1536px.

I should be able to provide some examples and images from this test run @ 1024px. I'll let you know once I've got somewhere with the training - it's going to be a long one for sure :)

The harder thing now is to find high enough quality datasets to make use of the new model size. It may mean I have to remaster some of the sources.


User avatar
ianstephens
Posts: 97
Joined: Sun Feb 14, 2021 7:20 pm
Has thanked: 11 times
Been thanked: 7 times

Re: Potential VRAM Saving techniques

Post by ianstephens »

Just an update.

I've managed to compile two decent data sources for both A/B and extracted at 1535px to run with the new DNY 1024 model.

I've noticed a few things.

We are running Full Precision @ batch size 4 (the highest we could get even after the initial model build).

I'm noticing that there is a much longer delay between iterations compared with, for example, the 256 StoJo model. The GPU is not used to its full potential. For example, this is a graph for GPU usage:

Screen Shot 2022-08-04 at 19.45.59.png
Screen Shot 2022-08-04 at 19.45.59.png (120.45 KiB) Viewed 76 times

Where with the 256 StoJo model we would see a nice 90-100% usage at all times.

I believe the slowdown is caused by the larger amount of data (larger image sizes - which are 3x the size as before) needing to be transferred from the machine to the GPU after each iteration. Of course, once it has iterated through the entire image set and entered cached RAM memory, it's ever so slightly faster.

Because of this, I thought why not turn on "central storage". This allowed for increasing batch size to 5 (instead of 4). And I don't believe there to be any performance hit from early testing.

Just an update from this end. Once I'm a few days in I'll try to post some images.

Aside, have any of you guys played with one of these?:
https://www.nvidia.com/en-us/design-vis ... rtx-a6000/

...The 48GB VRAM is tempting.


User avatar
bryanlyon
Site Admin
Posts: 695
Joined: Fri Jul 12, 2019 12:49 am
Answers: 42
Location: San Francisco
Has thanked: 3 times
Been thanked: 169 times
Contact:

Re: Potential VRAM Saving techniques

Post by bryanlyon »

ianstephens wrote: Fri Aug 05, 2022 7:21 pm

Aside, have any of you guys played with one of these?:
https://www.nvidia.com/en-us/design-vis ... rtx-a6000/

...The 48GB VRAM is tempting.

I would LOVE to play with 48gb cards, sadly Open Source doesn't pay well enough to justify such large purchases (and nobody has volunteered to let us use them). That said, some of our users HAVE used them, and they assure us that they run on FS fine. One thing to note though is that while the cards have a lot of vram, they are effectively still just a mem-boosted 3090 and do not go (much) faster than a 3090 without the doubled ram. In general, I don't suggest upgrading your GPU if you can run the model your interested in at a lower batch size. You could always add a 2nd gpu to double your batch size while also increasing (not quite double) your training speed.

Obviously there are times when an a6000 is the better solution, but for 90% of users, it's just not.


User avatar
ianstephens
Posts: 97
Joined: Sun Feb 14, 2021 7:20 pm
Has thanked: 11 times
Been thanked: 7 times

Re: Potential VRAM Saving techniques

Post by ianstephens »

Yea I hear what you're saying. We couldn't get another 3090 (founders) in our machine as space is limited and they are behemoths of a card. Good to hear you have feedback from users of the A6000. When you say "double" batch size/speed do you refer to the exact same GPU as the second? We have a 2080ti also installed in the same machine but of course, distributed would limit both - am I correct in thinking that?

I believe the A6000 has a slimmer profile so could possibly fit with a 2x configuration in our setup. 2x A6000 would be a naughty configuration albeit at a huge cost.

What are your thoughts re. batch sizes? Is it much of an issue in terms of a decent swap with smaller batch sizes? I know larger batch sizes improve generalization - but really, is it much of a big deal? Your opinions are greatly appreciated :)


User avatar
torzdf
Posts: 1855
Joined: Fri Jul 12, 2019 12:53 am
Answers: 136
Has thanked: 80 times
Been thanked: 376 times

Re: Potential VRAM Saving techniques

Post by torzdf »

ianstephens wrote: Fri Aug 05, 2022 7:21 pm

Just an update.
I'm noticing that there is a much longer delay between iterations compared with, for example, the 256 StoJo model. The GPU is not used to its full potential. For example, this is a graph for

I believe the slowdown is caused by the larger amount of data (larger image sizes - which are 3x the size as before) needing to be transferred from the machine to the GPU after each iteration. Of course, once it has iterated through the entire image set and entered cached RAM memory, it's ever so slightly faster.

Thanks. I will need to revisit dataloaders. When I last optimized, I optimized to 1024px thinking that would be plenty, and I'm sure some regressions will have gone in since then.

My word is final


Post Reply