Page 1 of 3
Distributed with Dual 2060 supers
Posted: Thu Oct 01, 2020 3:34 am
by dheinz70
I just upgraded from my old gtx780 to Two rtx2060 supers. I have a couple questions about training.
What is the difference between clicking the "Distributed" button and just leaving it to use both cards? From what I've been able to find in the forums Distributed is desirable, but I'm not clear as to the why?

- Screenshot from 2020-09-30 22-32-10.png (8.41 KiB) Viewed 17813 times
Re: Distributed with Dual 2060 supers
Posted: Thu Oct 01, 2020 3:49 am
by bryanlyon
From the training guide: viewtopic.php?f=6&t=146&p=456#settings
torzdf wrote: ↑Sun Sep 29, 2019 10:38 pmDistributed - [NVIDIA ONLY] - Enable Tensorflow's Mirrored Distribution strategy for training on Multiple GPUs. If you have multiple GPUs in your system, then you can utilize them to speed up training. Do note that this speed up is not linear, and the more GPUs you add, the more diminishing returns will kick in. Ultimately it allows you to train bigger batch sizes by splitting them across multiple GPUs. You will always be bottlenecked by the speed and VRAM of your weakest GPU, so this works best when training on identical GPUs. You can read more about Tensorflow Distribution Strategies here.
Without this, it will only use one GPU even if you have multiple.
Re: Distributed with Dual 2060 supers
Posted: Thu Oct 01, 2020 3:54 am
by dheinz70
Followup if I may. Why does it take so long to start? It stays at this point for quite a long time...
09/30/2020 22:31:50 INFO batch_all_reduce: 102 all-reduces with algorithm = nccl, num_packs = 1
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:32:06 INFO batch_all_reduce: 102 all-reduces with algorithm = nccl, num_packs = 1
Then sometimes it fails or just stays here and never starts training? Is there that much overhead getting two cards to work together over just one card?
Re: Distributed with Dual 2060 supers
Posted: Thu Oct 01, 2020 8:12 am
by torzdf
Which model?
I don't have multi-gpus to test this, but have had no feed back that enabling distributed training slows the startup down.
[mention]deephomage[/mention] or [mention]abigflea[/mention] may be able to give some feedback though.
Re: Distributed with Dual 2060 supers
Posted: Thu Oct 01, 2020 9:03 am
by abigflea
sometimes it will take 2-3 min for me. Never considered it an issue when its going to be training for 10 days anyway
Re: Distributed with Dual 2060 supers
Posted: Thu Oct 01, 2020 10:40 pm
by djandg
I can confirm it does take longer to initialise with 2 GPU's (when distributed) than using a single GPU. Found that out hard way and now use the fact training may start quickly as sort of error message to remind me to tick the distributed box to use both GPU's when starting a new training.
It's still relative short initialisation time though in reality.
If initialisation fails, try changing to Studio driver instead of Game driver - latest version of course.
Re: Distributed with Dual 2060 supers
Posted: Fri Oct 02, 2020 3:27 am
by dheinz70
I've been doing some playing with this. It seems that GPU0 causes any attempt to use it to fail. It keeps spitting out OOM errors. When I just select GPU1 it works like it is supposed to. I'm using Ubuntu20.04 and I think it might be related to how my 2nd monitor (connected to the GPU1) won't work properly.

- Screenshot from 2020-10-01 22-26-34.png (138.27 KiB) Viewed 17781 times
Both gpus are identical. Any ideas?
Re: Distributed with Dual 2060 supers
Posted: Fri Oct 02, 2020 4:01 am
by dheinz70
The verbose log is too big to attach. It keeps giving
2020-10-01 22:41:27.315577: I tensorflow/stream_executor/cuda/cuda_driver.cc:763] failed to allocate 5.06G (5437426176 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-10-
Re: Distributed with Dual 2060 supers
Posted: Fri Oct 02, 2020 5:47 am
by dheinz70
Definitely starting to look like it is a problem with X on linux and probably not a FaceSwap issue. It doesn't like anything to do with gpu0 no matter how I have the cards (1 or 2 installed). Found some others having similar probs, any help appreciated.
Re: Distributed with Dual 2060 supers
Posted: Fri Oct 02, 2020 7:59 am
by dheinz70
For any linux users I opened a ticket with Nvidia. Not a faceswap issue. It is clearly a driver problem with Nvidia.
https://forums.developer.nvidia.com/t/2 ... -04/156103
Re: Distributed with Dual 2060 supers
Posted: Fri Oct 02, 2020 4:27 pm
by dheinz70
My displayport to dvi adapter came this morning. I plugged monitor1 into gpu0. And Gnome came up using both screens as it did with my old single card. My guess is the Nvidia driver just doesn't want to have display0 plugged into gpu0 and display1 plugged into gpu1. It wants them both plugged into gpu0. That is working now.
Still can't get GPU0 to do cuda tasks. Always says it can't allocate the memory on the card.
2020-10-01 22:41:27.315577: I tensorflow/stream_executor/cuda/cuda_driver.cc:763] failed to allocate 5.06G (5437426176 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Re: Distributed with Dual 2060 supers
Posted: Fri Oct 02, 2020 5:38 pm
by abigflea
You have tried with a single GPU?
Tried activating "allow growth"?
Reduce your batch size to something like 2 - 4 to see if that helps?
that is a frustrating snag.
Re: Distributed with Dual 2060 supers
Posted: Fri Oct 02, 2020 6:17 pm
by dheinz70
Allow growth selected. Tried with batches of 2 and 1 still fails.
Attached is the system info.
Re: Distributed with Dual 2060 supers
Posted: Fri Oct 02, 2020 6:36 pm
by abigflea
Ok. While im no expert , Lets try a couple ideas if you havent already.
First try a different model. try original.
If that doesn't work, regenerate your alignments.
i swear I had that
Code: Select all
MainProcess _training_0 multithreading run DEBUG Error in thread (_training_0): OOM when allocating tensor with shape[7,126,128,128] and type float on /jo
before, but only once. Fixed it after re-doing my alignments & mask. sadly I cannot remember the circumstance that caused it.
In any case, I can't imagine why your not up and running.
Re: Distributed with Dual 2060 supers
Posted: Fri Oct 02, 2020 6:44 pm
by dheinz70
Selecting ONLY GPU0 I can get lightweight (batch32), DF128 (10) and original working.
SAE with a batch of 1 fails.
Distributed - DFL-128 with batch of 8 running. This is wierd.
I'll try the alignment thing next.
Re: Distributed with Dual 2060 supers
Posted: Fri Oct 02, 2020 7:16 pm
by abigflea
Re: Distributed with Dual 2060 supers
Posted: Fri Oct 02, 2020 8:09 pm
by dheinz70
Redid alignments.
Distributed, Original, Batch 128 - worked!
Distributed, DFL-H128, batch 32 - worked!
I'll test more, but you might have figured it out. THANKS
Re: Distributed with Dual 2060 supers
Posted: Fri Oct 02, 2020 9:21 pm
by dheinz70
Distributed, DFL-SAE batch 2, FAILS.
Distributed, Villain batch 4, works.
Could it be an issue with SAE and distributed?
Re: Distributed with Dual 2060 supers
Posted: Fri Oct 02, 2020 11:42 pm
by dheinz70
Distributed Dlight works too.
Question: Is it normal for it to take 15-25 mins to start training on distributed? I've got a pretty beefy system, just wondering what's normal?
Re: Distributed with Dual 2060 supers
Posted: Sat Oct 03, 2020 4:21 am
by dheinz70
I notice this A LOT in the verbose logs as it starts to train.
failed to allocate 3.24G (3477464576 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Without MIXED PRECISION I see it a lot.
With it check just one or two instances. I beginning to think this is where the issue is lying.