Potential VRAM Saving techniques

Post by **torzdf** » Mon Jul 18, 2022 1:47 am

[Topic split from: viewtopic.php?t=2058]

I have done a not inconsiderable amount of testing around NaNs recently, really hoping I would find a solve, and I have drawn a blank. Where they get introduced in a model is inconsistent, although they nearly always seem to appear at some point in the forward pass, which clipping would not be able to help with.

The sad fact is that Mixed Precision does increase the chance of NaNs, that is just the nature of using a more limited numerical range. My focus, more recently, has been on looking at other ways to reduce VRAM usage, to easier enable Full Precision training, but Tensorflow is making this particularly hard for me. Method's I could easily introduce in TF1.x have been totally disposed of since TF2.x, and implementing any are somewhere between very difficult and impossible. If I could start this project again I would, undoubtedly, use PyTorch, where implementing these kinds of features is a LOT easier. TF devs just don't appear to care (about this and a myriad of other issues).

The other factor is that bigger and more complex models need lower learning rates. Latest models I'm developing need to be starting at between 1e-5 to 3e-5. That is also just a matter of fact.

Yet another factor feeding this is that batch size is proportional to learning rate. If you lower the batch size, you should lower the learning rate. This is kind of logical, as smaller batch-sizes mean that outliers will have a larger effect on gradients. As models get larger and more complex, lower batch sizes are unavoidable. There has been some recent research around this, which I have not yet had an opportunity to fully digest: https://arxiv.org/abs/2006.09092

ianstephens · Post by **ianstephens** » Mon Jul 18, 2022 9:15 pm

torzdf wrote: ↑Mon Jul 18, 2022 1:47 am
I have done a not inconsiderable amount of testing around NaNs recently, really hoping I would find a solve, and I have drawn a blank. Where they get introduced in a model is inconsistent, although they nearly always seem to appear at some point in the forward pass, which clipping would not be able to help with.

Yes, I believed also the NaN happened before the chance the clipping processor had to deal with it.

torzdf wrote: ↑Mon Jul 18, 2022 1:47 am
The sad fact is that Mixed Precision does increase the chance of NaNs, that is just the nature of using a more limited numerical range. My focus, more recently, has been on looking at other ways to reduce VRAM usage, to easier enable Full Precision training, but Tensorflow is making this particularly hard for me. Method's I could easily introduce in TF1.x have been totally disposed of since TF2.x, and implementing any are somewhere between very difficult and impossible. If I could start this project again I would, undoubtedly, use PyTorch, where implementing these kinds of features is a LOT easier. TF devs just don't appear to care (about this and a myriad of other issues).

I think this idea is the solution though. To be able to use a full 32-bit range would seem to be a resolution. As you say, reducing VRAM usage is the key stumbling block. There is no way we could run the models we do, in their current format without mixed precision. Even with the behemoth 3090 and 24GB of VRAM, we have to run at batch sizes of 4-5 on these large models. Would it be a mammoth exercise to port-over to PyTorch?

torzdf wrote: ↑Mon Jul 18, 2022 1:47 am
The other factor is that bigger and more complex models need lower learning rates. Latest models I'm developing need to be starting at between 1e-5 to 3e-5. That is also just a matter of fact.

That's interesting. What do you set for your EE on these tests? Also, do you modify/increase the learning rate once the model becomes somewhat "stable"?

torzdf wrote: ↑Mon Jul 18, 2022 1:47 am
Yet another factor feeding this is that batch size is proportional to learning rate. If you lower the batch size, you should lower the learning rate. This is kind of logical, as smaller batch-sizes mean that outliers will have a larger effect on gradients. As models get larger and more complex, lower batch sizes are unavoidable. There has been some recent research around this, which I have not yet had an opportunity to fully digest: https://arxiv.org/abs/2006.09092

Yes, like mentioned above - even with a 3090 and 24GB VRAM we have to use stupidly low batch sizes (4-5) on these large models. I will take a look at the link you provided.

Happy to donate some extra $ in order to port-over to PyTorch

Aside, the project is fab, and thank you for all you do.

P.s. still loving EfficientNetV2

Post by **torzdf** » Wed Jul 20, 2022 3:59 pm

ianstephens wrote: ↑Mon Jul 18, 2022 9:15 pm
Would it be a mammoth exercise to port-over to PyTorch?

Yes, it would be huge. So much so that I would prefer to be buried in the quagmire of TF code than even attempt it at this stage.

ianstephens wrote: ↑Mon Jul 18, 2022 9:15 pm
That's interesting. What do you set for your EE on these tests? Also, do you modify/increase the learning rate once the model becomes somewhat "stable"?

-5 for EE, although given what you have said -4 may be better. I just hate the idea of losing accuracy.

I never increase learning rate, only ever decrease it. Increasing it again doesn't really make sense in the realms of ML.

ianstephens wrote: ↑Mon Jul 18, 2022 9:15 pm
Yes, like mentioned above - even with a 3090 and 24GB VRAM we have to use stupidly low batch sizes (4-5) on these large models. I will take a look at the link you provided.

Well, I have now added something which should save some VRAM, although not as much as Mixed Precision does, so it may not solve your issue.

I mostly added this for single GPU use, but it is a valid strategy for Multi-GPU use too. The current Multi-GPU (Mirrored) strategy copies the variables to all connected GPUs. This is faster as each GPU has its own copy and just updates locally.

Central Storage Strategy (https://www.tensorflow.org/api_docs/pyt ... geStrategy) is a technique which keeps the variables on a single device and updates them centrally. I have forced this to store variables on the CPU. This will free up some VRAM. Not a huge amount, but enough for me to raise the batch size a little in local testing.

The downside is it is not compatible with Mixed Precision training, for the aims that I am trying to achieve, that is not necessarily a bad thing, but it's worth bearing in mind.

For single-GPU use, I have actually hacked it so that it will support Mixed Precision training, but for Multi-GPU it's definitely a no-no for now. You may want to test this out (if you can get your model loaded at full precision), because it definitely does save some VRAM, and the increased overhead of host <-> gpu copies does not seem too bad in my limited testing. It also doesn't seem to force the strange loss printing errors forcing a crash that mirrored strategy introduces.

I have another technique for limiting VRAM usage in development at the moment, but this is a lot more involved and is likely to take me much longer to implement (if I manage to do so at all). However, it should lead to vastly improved VRAM usage at the cost of speed.

ianstephens · Post by **ianstephens** » Fri Jul 22, 2022 8:13 pm

So I've been playing with some models, different settings, etc.

I've been using your new DNY models as you are familiar with these and provide a better understanding when reporting back.

Testing on a single 3090 (24GB VRAM), I am able to load your 1024 px DNY model @ a batch size of 4 max at FP. I haven't tried MP but of course, it would be higher. I am (also) trying to stay away from MP due to the NaNs. I am running an EE of 7 with a learning rate of 2e-05 for this particular model - would you agree?

I have also been playing with "central storage". While it does enable me to increase the batch size so very slightly, I am not a fan of it as I've noticed at least a 40-50% performance (speed) hit when it comes to iteration times.

The transfers to/from the CPU <--> GPU are far too slow as opposed to the GPU having it right there in memory. I don't think it's worth the performance hit in comparison to the small increases in batch size. However, If the increase in batch size were quite a bit more substantial I think the performance hit would be worth it.

On a side note, when running your DNY model @ 1024px we get the following error in the log window:

Code: Select all

DecompressionBombWarning: Image size (118638000 pixels) exceeds limit of 89478485 pixels, could be decompression bomb DOS attack.

I have another technique for limiting VRAM usage in development at the moment, but this is a lot more involved and is likely to take me much longer to implement (if I manage to do so at all). However, it should lead to vastly improved VRAM usage at the cost of speed.

I will look forward to this and it sounds fab

If you need me to test anything at my end then feel free to ask. I always pop in every few days to catch up

As a side note - the new preview window (I think matplotlib) is problematic. It's fine initially but after ₆₀,000 iterations it's so laggy it's unusable. We have to constantly switch to/from focus on it to get a preview. It's just so laggy after many iterations. We have resorted back to using the preview windows directly on the FS GUI. Perhaps there is some missing garbage collection etc function missing - I don't know - you are the expert

Post by **torzdf** » Mon Jul 25, 2022 12:06 am

ianstephens wrote: ↑Fri Jul 22, 2022 8:13 pm
Testing on a single 3090 (24GB VRAM), I am able to load your 1024 px DNY model @ a batch size of 4 max at FP. I haven't tried MP but of course, it would be higher. I am (also) trying to stay away from MP due to the NaNs. I am running an EE of 7 with a learning rate of 2e-05 for this particular model - would you agree?

That should be fine. If you do train the model to any distance and have anything you can share, I would certainly be interested to see. Sadly I can't give up the compute time whilst I work on other things.

I have also been playing with "central storage". While it does enable me to increase the batch size so very slightly, I am not a fan of it as I've noticed at least a 40-50% performance (speed) hit when it comes to iteration times.

That's not so great. Had hoped the performance hit would not be that high. I guess it's more useful for edge cases to load models which would otherwise not be loadable (A measure of last resort).

However, If the increase in batch size were quite a bit more substantial I think the performance hit would be worth it.

The current solution I'm looking at will be more in this area... if it is even implementable in TF2.0. Rest assured it is something I am looking at, but I am still not convinced I will be able to implement it. Time will tell.

On a side note, when running your DNY model @ 1024px we get the following error in the log window:
Code: Select all
DecompressionBombWarning: Image size (118638000 pixels) exceeds limit of 89478485 pixels, could be decompression bomb DOS attack.

Thanks. Will look into this when I get a second. It is just a warning, so can probably be ignored. I may look to intercept it and output a more useful warning and actions a user can take. Basically, at default preview display you are getting a 12,288x7,168px image which is, frankly, insane. I would advise reducing the number of images in the preview display (can be done changed in Train Settings > Trainer > Number of preview images)

As a side note - the new preview window (I think matplotlib) is problematic. It's fine initially but after ₆₀,000 iterations it's so laggy it's unusable. We have to constantly switch to/from focus on it to get a preview. It's just so laggy after many iterations. We have resorted back to using the preview windows directly on the FS GUI.

Thanks, will look at this too. I don't generally use the pop-up preview (I use the GUI Preview, and save out images on occasion when I want to examine more closely). These kind of issues which don't come up until a period of time has passed tend to get missed.

ianstephens · Post by **ianstephens** » Thu Jul 28, 2022 10:26 pm

A quick one... and I will reply to this post above very soon... however...

What input image sizes do you use for your DNY 1024px model? We have been running extracting at 1024px - agree?

Post by **torzdf** » Thu Jul 28, 2022 11:50 pm

No, I would go higher. I have the maths somewhere....

Basically, if you are training on Face Centering at 100% you'd want to multiply the model input size by 1,33% (this is how much extra space is padded on to extracted images for head centering).

So 1024 (model output size) * 1.333 = 1366px (the size of a full extract image which holds a 1024px face centered image). If you are doing coverage below 100% you would need to up the ratio accordingly.

ianstephens · Post by **ianstephens** » Fri Jul 29, 2022 9:10 am

Yes, you're right - what was I thinking. Was already aware of this padding from previous (smaller) models. Having a slow day (or week)

I'm going to run extraction @ 1536px.

I should be able to provide some examples and images from this test run @ 1024px. I'll let you know once I've got somewhere with the training - it's going to be a long one for sure

The harder thing now is to find high enough quality datasets to make use of the new model size. It may mean I have to remaster some of the sources.

ianstephens · Post by **ianstephens** » Fri Aug 05, 2022 7:21 pm

Just an update.

I've managed to compile two decent data sources for both A/B and extracted at 1535px to run with the new DNY 1024 model.

I've noticed a few things.

We are running Full Precision @ batch size 4 (the highest we could get even after the initial model build).

I'm noticing that there is a much longer delay between iterations compared with, for example, the 256 StoJo model. The GPU is not used to its full potential. For example, this is a graph for GPU usage:

: Screen Shot 2022-08-04 at 19.45.59.png (120.45 KiB) Viewed 5441 times

Where with the 256 StoJo model we would see a nice 90-100% usage at all times.

I believe the slowdown is caused by the larger amount of data (larger image sizes - which are 3x the size as before) needing to be transferred from the machine to the GPU after each iteration. Of course, once it has iterated through the entire image set and entered cached RAM memory, it's ever so slightly faster.

Because of this, I thought why not turn on "central storage". This allowed for increasing batch size to 5 (instead of 4). And I don't believe there to be any performance hit from early testing.

Just an update from this end. Once I'm a few days in I'll try to post some images.

Aside, have any of you guys played with one of these?:
https://www.nvidia.com/en-us/design-vis ... rtx-a6000/

...The 48GB VRAM is tempting.

Post by **bryanlyon** » Fri Aug 05, 2022 7:33 pm

ianstephens wrote: ↑Fri Aug 05, 2022 7:21 pm
Aside, have any of you guys played with one of these?:
https://www.nvidia.com/en-us/design-vis ... rtx-a6000/

...The 48GB VRAM is tempting.

I would LOVE to play with 48gb cards, sadly Open Source doesn't pay well enough to justify such large purchases (and nobody has volunteered to let us use them). That said, some of our users HAVE used them, and they assure us that they run on FS fine. One thing to note though is that while the cards have a lot of vram, they are effectively still just a mem-boosted 3090 and do not go (much) faster than a 3090 without the doubled ram. In general, I don't suggest upgrading your GPU if you can run the model your interested in at a lower batch size. You could always add a 2nd gpu to double your batch size while also increasing (not quite double) your training speed.

Obviously there are times when an a6000 is the better solution, but for 90% of users, it's just not.

ianstephens · Post by **ianstephens** » Fri Aug 05, 2022 10:28 pm

Yea I hear what you're saying. We couldn't get another 3090 (founders) in our machine as space is limited and they are behemoths of a card. Good to hear you have feedback from users of the A6000. When you say "double" batch size/speed do you refer to the exact same GPU as the second? We have a 2080ti also installed in the same machine but of course, distributed would limit both - am I correct in thinking that?

I believe the A6000 has a slimmer profile so could possibly fit with a 2x configuration in our setup. 2x A6000 would be a naughty configuration albeit at a huge cost.

What are your thoughts re. batch sizes? Is it much of an issue in terms of a decent swap with smaller batch sizes? I know larger batch sizes improve generalization - but really, is it much of a big deal? Your opinions are greatly appreciated

Post by **torzdf** » Sat Aug 06, 2022 4:11 pm

ianstephens wrote: ↑Fri Aug 05, 2022 7:21 pm
Just an update.
I'm noticing that there is a much longer delay between iterations compared with, for example, the 256 StoJo model. The GPU is not used to its full potential. For example, this is a graph for

I believe the slowdown is caused by the larger amount of data (larger image sizes - which are 3x the size as before) needing to be transferred from the machine to the GPU after each iteration. Of course, once it has iterated through the entire image set and entered cached RAM memory, it's ever so slightly faster.

Thanks. I will need to revisit dataloaders. When I last optimized, I optimized to 1024px thinking that would be plenty, and I'm sure some regressions will have gone in since then.

Post by **torzdf** » Sun Aug 21, 2022 7:02 pm

ianstephens wrote: ↑Fri Aug 05, 2022 7:21 pm
We are running Full Precision @ batch size 4 (the highest we could get even after the initial model build).

I'm noticing that there is a much longer delay between iterations compared with, for example, the 256 StoJo model. The GPU is not used to its full potential. For example, this is a graph for

Following your feedback on this, I have spent a not inconsiderable amount of time revisiting the dataloaders. I would be interested to know if it makes any difference for you. Also squashed some bugs I did not know existed.

Details:

Bugs fixed:

Issue with preview needing to be refreshed twice to update the images

Issue where multiple threads would read/write to the same memory pointer, leading to image corruption

Fixes an issue where warp/warp to landmarks would behave differently depending on training image size

Fixes an issue where masks would sometimes be slightly misaligned

Optimizations:

Minor optimizations across the whole augmentation pipeline

Move more processing into the caching stage (slows down caching, but speeds up subsequent iterations)

General refactoring to make the code more maintainable

Process data at model input/output size rather than saved image size

Some numbers:

Test data (1043 A images, 1229 B images). 1000 iterations

512px training image size, 64px model output size, 64 batch size: ~65% faster (including caching)

684px training image size, 512px model output size, 8 batch size: ~15% faster (including caching)

ianstephens · Post by **ianstephens** » Mon Aug 22, 2022 11:58 am

Let me run some tests and grab some data. Leave it with me

ianstephens · Post by **ianstephens** » Mon Aug 22, 2022 2:04 pm

I thought an easy way to gauge would be to run the exact same settings with the only change being updating to the current codebase with the optimizations applied. Then we can look at EGs/s.

Running DNY 1024 model with 1536px images. Side A has ₈₀₀₀ images and side B has ₅₀₀₀. We are running batch size 4 with central storage enabled.

Before update:

: Screen Shot 2022-08-22 at 13.54.37.png (8.22 KiB) Viewed 5238 times

After update stats with exact same settings as noted above:

: Screen Shot 2022-08-23 at 08.38.16.png (6.02 KiB) Viewed 5215 times

P.s. on a side note, I had a little play with batch sizes since the update. We can now go up to BS5 with the standard (on-GPU) storage option. Before we were only able to get BS 3-4 and enabled central storage to be stable at BS4.

ianstephens · Post by **ianstephens** » Mon Aug 22, 2022 5:51 pm

Update: I want to run at least 20-30k iterations as I have with the previous data above.

The reason being - from what I can tell - it appears onece the new caching/processing you mentioned seems to have cycled through the entire image set, the time between iterations started becoming noticeably faster.

I will update the above post with stats in the morning. From what I can feel it does indeed seem faster. Data will tell in the morning

Post by **torzdf** » Mon Aug 22, 2022 8:17 pm

Well, not necessarily. The batch processing time is faster. However, if the batch processing time is not the bottleneck, but the actual model and/or copying to<->from the GPU is the bottleneck, then faster preparation of images will make no difference.

Post by **torzdf** » Mon Aug 22, 2022 8:37 pm

ianstephens wrote: ↑Mon Aug 22, 2022 2:04 pm
P.s. on a side note, I had a little play with batch sizes since the update. We can now go up to BS5 with the standard (on-GPU) storage option. Before we were only able to get BS 3-4 and enabled central storage to be stable at BS4.

Thanks for testing this, btw. If nothing else, it has helped me squash some bugs I did not know existed.

This update wouldn't impact BS at all, but good to know you got a bit more out of it.

How does the utilization graph look? Much the same?

ianstephens · Post by **ianstephens** » Mon Aug 22, 2022 9:43 pm

That's interesting that any of the updates wouldn't affect VRAM usage and batch size - don't know what happened there.

GPU utilization (on the same settings as the last screenshot) certainly improved albeit not massively but any little helps:

: Screen Shot 2022-08-22 at 22.38.47.png (73.14 KiB) Viewed 5226 times

The lows and peaks are slightly less drastic in the graph and the GPU seems to run hotter over the session so it has certainly helped. However, I'm unsure of the refresh rate/monitoring accuracy but it's a reasonable baseline to use for monitoring changes in setup.

I would be interested to run this model with the GPU-stored variables (not central storage). Once I have data from the central storage test tomorrow, I will then move on to running it all on GPU and report back findings.

Post by **torzdf** » Mon Aug 22, 2022 9:44 pm

Perfect, many thanks.

Central Storage will definitely cause troughs and peaks. By how much it impacts speed (and how bearable that is for you) I am not so sure though.

Faceswap Forum

Potential VRAM Saving techniques

Potential VRAM Saving techniques

Re: AutoClip: Any Feedback From Users?

Re: AutoClip: Any Feedback From Users?

Re: Potential VRAM Saving techniques

Re: Potential VRAM Saving techniques

Re: Potential VRAM Saving techniques

Re: Potential VRAM Saving techniques

Re: Potential VRAM Saving techniques

Re: Potential VRAM Saving techniques

Re: Potential VRAM Saving techniques

Re: Potential VRAM Saving techniques

Re: Potential VRAM Saving techniques

Re: Potential VRAM Saving techniques

Re: Potential VRAM Saving techniques

Re: Potential VRAM Saving techniques

Re: Potential VRAM Saving techniques

Re: Potential VRAM Saving techniques

Re: Potential VRAM Saving techniques

Re: Potential VRAM Saving techniques

Re: Potential VRAM Saving techniques