Distributed with Dual 2060 supers

dheinz70 · Post by **dheinz70** » Thu Oct 01, 2020 3:34 am

I just upgraded from my old gtx780 to Two rtx2060 supers. I have a couple questions about training.

What is the difference between clicking the "Distributed" button and just leaving it to use both cards? From what I've been able to find in the forums Distributed is desirable, but I'm not clear as to the why?

: Screenshot from 2020-09-30 22-32-10.png (8.41 KiB) Viewed 6457 times

Post by **bryanlyon** » Thu Oct 01, 2020 3:49 am

From the training guide: viewtopic.php?f=6&t=146&p=456#settings

torzdf wrote: ↑Sun Sep 29, 2019 10:38 pm
Distributed - [NVIDIA ONLY] - Enable Tensorflow's Mirrored Distribution strategy for training on Multiple GPUs. If you have multiple GPUs in your system, then you can utilize them to speed up training. Do note that this speed up is not linear, and the more GPUs you add, the more diminishing returns will kick in. Ultimately it allows you to train bigger batch sizes by splitting them across multiple GPUs. You will always be bottlenecked by the speed and VRAM of your weakest GPU, so this works best when training on identical GPUs. You can read more about Tensorflow Distribution Strategies here.

Without this, it will only use one GPU even if you have multiple.

dheinz70 · Post by **dheinz70** » Thu Oct 01, 2020 3:54 am

Followup if I may. Why does it take so long to start? It stays at this point for quite a long time...

09/30/2020 22:31:50 INFO batch_all_reduce: 102 all-reduces with algorithm = nccl, num_packs = 1
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:32:06 INFO batch_all_reduce: 102 all-reduces with algorithm = nccl, num_packs = 1

Then sometimes it fails or just stays here and never starts training? Is there that much overhead getting two cards to work together over just one card?

Post by **torzdf** » Thu Oct 01, 2020 8:12 am

Which model?

I don't have multi-gpus to test this, but have had no feed back that enabling distributed training slows the startup down.

[mention]deephomage[/mention] or [mention]abigflea[/mention] may be able to give some feedback though.

Post by **abigflea** » Thu Oct 01, 2020 9:03 am

sometimes it will take 2-3 min for me. Never considered it an issue when its going to be training for 10 days anyway

djandg · Post by **djandg** » Thu Oct 01, 2020 10:40 pm

I can confirm it does take longer to initialise with 2 GPU's (when distributed) than using a single GPU. Found that out hard way and now use the fact training may start quickly as sort of error message to remind me to tick the distributed box to use both GPU's when starting a new training.
It's still relative short initialisation time though in reality.

If initialisation fails, try changing to Studio driver instead of Game driver - latest version of course.

dheinz70 · Post by **dheinz70** » Fri Oct 02, 2020 3:27 am

I've been doing some playing with this. It seems that GPU0 causes any attempt to use it to fail. It keeps spitting out OOM errors. When I just select GPU1 it works like it is supposed to. I'm using Ubuntu20.04 and I think it might be related to how my 2nd monitor (connected to the GPU1) won't work properly.

: Screenshot from 2020-10-01 22-26-34.png (138.27 KiB) Viewed 6425 times

Both gpus are identical. Any ideas?

dheinz70 · Post by **dheinz70** » Fri Oct 02, 2020 4:01 am

The verbose log is too big to attach. It keeps giving

2020-10-01 22:41:27.315577: I tensorflow/stream_executor/cuda/cuda_driver.cc:763] failed to allocate 5.06G (5437426176 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-10-

dheinz70 · Post by **dheinz70** » Fri Oct 02, 2020 5:47 am

Definitely starting to look like it is a problem with X on linux and probably not a FaceSwap issue. It doesn't like anything to do with gpu0 no matter how I have the cards (1 or 2 installed). Found some others having similar probs, any help appreciated.

dheinz70 · Post by **dheinz70** » Fri Oct 02, 2020 7:59 am

For any linux users I opened a ticket with Nvidia. Not a faceswap issue. It is clearly a driver problem with Nvidia.

https://forums.developer.nvidia.com/t/2 ... -04/156103

dheinz70 · Post by **dheinz70** » Fri Oct 02, 2020 4:27 pm

My displayport to dvi adapter came this morning. I plugged monitor1 into gpu0. And Gnome came up using both screens as it did with my old single card. My guess is the Nvidia driver just doesn't want to have display0 plugged into gpu0 and display1 plugged into gpu1. It wants them both plugged into gpu0. That is working now.

Still can't get GPU0 to do cuda tasks. Always says it can't allocate the memory on the card.

2020-10-01 22:41:27.315577: I tensorflow/stream_executor/cuda/cuda_driver.cc:763] failed to allocate 5.06G (5437426176 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

Post by **abigflea** » Fri Oct 02, 2020 5:38 pm

You have tried with a single GPU?

Tried activating "allow growth"?

Reduce your batch size to something like 2 - 4 to see if that helps?

that is a frustrating snag.

dheinz70 · Post by **dheinz70** » Fri Oct 02, 2020 6:17 pm

Allow growth selected. Tried with batches of 2 and 1 still fails.

Attached is the system info.

Post by **abigflea** » Fri Oct 02, 2020 6:36 pm

Ok. While im no expert , Lets try a couple ideas if you havent already.

First try a different model. try original.

If that doesn't work, regenerate your alignments.
i swear I had that

Code: Select all

MainProcess     _training_0     multithreading  run                       DEBUG    Error in thread (_training_0):  OOM when allocating tensor with shape[7,126,128,128] and type float on /jo

before, but only once. Fixed it after re-doing my alignments & mask. sadly I cannot remember the circumstance that caused it.
In any case, I can't imagine why your not up and running.

dheinz70 · Post by **dheinz70** » Fri Oct 02, 2020 6:44 pm

Selecting ONLY GPU0 I can get lightweight (batch32), DF128 (10) and original working.

SAE with a batch of 1 fails.

Distributed - DFL-128 with batch of 8 running. This is wierd.

I'll try the alignment thing next.

Post by **abigflea** » Fri Oct 02, 2020 7:16 pm

try ONLY gpu1 as well

dheinz70 · Post by **dheinz70** » Fri Oct 02, 2020 8:09 pm

Redid alignments.

Distributed, Original, Batch 128 - worked!

Distributed, DFL-H128, batch 32 - worked!

I'll test more, but you might have figured it out. THANKS

dheinz70 · Post by **dheinz70** » Fri Oct 02, 2020 9:21 pm

Distributed, DFL-SAE batch 2, FAILS.

Distributed, Villain batch 4, works.

Could it be an issue with SAE and distributed?

dheinz70 · Post by **dheinz70** » Fri Oct 02, 2020 11:42 pm

Distributed Dlight works too.

Question: Is it normal for it to take 15-25 mins to start training on distributed? I've got a pretty beefy system, just wondering what's normal?

dheinz70 · Post by **dheinz70** » Sat Oct 03, 2020 4:21 am

I notice this A LOT in the verbose logs as it starts to train.

failed to allocate 3.24G (3477464576 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

Without MIXED PRECISION I see it a lot.

With it check just one or two instances. I beginning to think this is where the issue is lying.

Faceswap Forum

Distributed with Dual 2060 supers

Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers