Distributed with Dual 2060 supers

If training is failing to start, and you are not receiving an error message telling you what to do, tell us about it here


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.

Please mark any answers that fixed your problems so others can find the solutions.

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Distributed with Dual 2060 supers

Post by dheinz70 »

I just upgraded from my old gtx780 to Two rtx2060 supers. I have a couple questions about training.

What is the difference between clicking the "Distributed" button and just leaving it to use both cards? From what I've been able to find in the forums Distributed is desirable, but I'm not clear as to the why?

Screenshot from 2020-09-30 22-32-10.png
Screenshot from 2020-09-30 22-32-10.png (8.41 KiB) Viewed 6393 times
User avatar
bryanlyon
Site Admin
Posts: 793
Joined: Fri Jul 12, 2019 12:49 am
Answers: 44
Location: San Francisco
Has thanked: 4 times
Been thanked: 218 times
Contact:

Re: Distributed with Dual 2060 supers

Post by bryanlyon »

From the training guide: viewtopic.php?f=6&t=146&p=456#settings

torzdf wrote: Sun Sep 29, 2019 10:38 pm

Distributed - [NVIDIA ONLY] - Enable Tensorflow's Mirrored Distribution strategy for training on Multiple GPUs. If you have multiple GPUs in your system, then you can utilize them to speed up training. Do note that this speed up is not linear, and the more GPUs you add, the more diminishing returns will kick in. Ultimately it allows you to train bigger batch sizes by splitting them across multiple GPUs. You will always be bottlenecked by the speed and VRAM of your weakest GPU, so this works best when training on identical GPUs. You can read more about Tensorflow Distribution Strategies here.

Without this, it will only use one GPU even if you have multiple.

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

Followup if I may. Why does it take so long to start? It stays at this point for quite a long time...

09/30/2020 22:31:50 INFO batch_all_reduce: 102 all-reduces with algorithm = nccl, num_packs = 1
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:32:06 INFO batch_all_reduce: 102 all-reduces with algorithm = nccl, num_packs = 1

Then sometimes it fails or just stays here and never starts training? Is there that much overhead getting two cards to work together over just one card?

User avatar
torzdf
Posts: 2649
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 128 times
Been thanked: 622 times

Re: Distributed with Dual 2060 supers

Post by torzdf »

Which model?

I don't have multi-gpus to test this, but have had no feed back that enabling distributed training slows the startup down.

[mention]deephomage[/mention] or [mention]abigflea[/mention] may be able to give some feedback though.

My word is final

User avatar
abigflea
Posts: 182
Joined: Sat Feb 22, 2020 10:59 pm
Answers: 2
Has thanked: 20 times
Been thanked: 62 times

Re: Distributed with Dual 2060 supers

Post by abigflea »

sometimes it will take 2-3 min for me. Never considered it an issue when its going to be training for 10 days anyway

:o I dunno what I'm doing :shock:
2X RTX 3090 : RTX 3080 : RTX: 2060 : 2x RTX 2080 Super : Ghetto 1060

User avatar
djandg
Posts: 43
Joined: Mon Dec 09, 2019 7:00 pm
Has thanked: 4 times
Been thanked: 2 times

Re: Distributed with Dual 2060 supers

Post by djandg »

I can confirm it does take longer to initialise with 2 GPU's (when distributed) than using a single GPU. Found that out hard way and now use the fact training may start quickly as sort of error message to remind me to tick the distributed box to use both GPU's when starting a new training.
It's still relative short initialisation time though in reality.

If initialisation fails, try changing to Studio driver instead of Game driver - latest version of course.

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

I've been doing some playing with this. It seems that GPU0 causes any attempt to use it to fail. It keeps spitting out OOM errors. When I just select GPU1 it works like it is supposed to. I'm using Ubuntu20.04 and I think it might be related to how my 2nd monitor (connected to the GPU1) won't work properly.

Screenshot from 2020-10-01 22-26-34.png
Screenshot from 2020-10-01 22-26-34.png (138.27 KiB) Viewed 6361 times

Both gpus are identical. Any ideas?

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

The verbose log is too big to attach. It keeps giving

2020-10-01 22:41:27.315577: I tensorflow/stream_executor/cuda/cuda_driver.cc:763] failed to allocate 5.06G (5437426176 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-10-

Attachments
crash_report.2020.10.01.130450962950.log
(38.44 KiB) Downloaded 185 times
User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

Definitely starting to look like it is a problem with X on linux and probably not a FaceSwap issue. It doesn't like anything to do with gpu0 no matter how I have the cards (1 or 2 installed). Found some others having similar probs, any help appreciated.

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

For any linux users I opened a ticket with Nvidia. Not a faceswap issue. It is clearly a driver problem with Nvidia.

https://forums.developer.nvidia.com/t/2 ... -04/156103

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

My displayport to dvi adapter came this morning. I plugged monitor1 into gpu0. And Gnome came up using both screens as it did with my old single card. My guess is the Nvidia driver just doesn't want to have display0 plugged into gpu0 and display1 plugged into gpu1. It wants them both plugged into gpu0. That is working now.

Still can't get GPU0 to do cuda tasks. Always says it can't allocate the memory on the card.

2020-10-01 22:41:27.315577: I tensorflow/stream_executor/cuda/cuda_driver.cc:763] failed to allocate 5.06G (5437426176 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

User avatar
abigflea
Posts: 182
Joined: Sat Feb 22, 2020 10:59 pm
Answers: 2
Has thanked: 20 times
Been thanked: 62 times

Re: Distributed with Dual 2060 supers

Post by abigflea »

You have tried with a single GPU?

Tried activating "allow growth"?

Reduce your batch size to something like 2 - 4 to see if that helps?

that is a frustrating snag.

:o I dunno what I'm doing :shock:
2X RTX 3090 : RTX 3080 : RTX: 2060 : 2x RTX 2080 Super : Ghetto 1060

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

Allow growth selected. Tried with batches of 2 and 1 still fails.

Attached is the system info.

Attachments
sysspecs.txt
(16.31 KiB) Downloaded 200 times
User avatar
abigflea
Posts: 182
Joined: Sat Feb 22, 2020 10:59 pm
Answers: 2
Has thanked: 20 times
Been thanked: 62 times

Re: Distributed with Dual 2060 supers

Post by abigflea »

Ok. While im no expert , Lets try a couple ideas if you havent already.

First try a different model. try original.

If that doesn't work, regenerate your alignments.
i swear I had that

Code: Select all

MainProcess     _training_0     multithreading  run                       DEBUG    Error in thread (_training_0):  OOM when allocating tensor with shape[7,126,128,128] and type float on /jo

before, but only once. Fixed it after re-doing my alignments & mask. sadly I cannot remember the circumstance that caused it.
In any case, I can't imagine why your not up and running.

:o I dunno what I'm doing :shock:
2X RTX 3090 : RTX 3080 : RTX: 2060 : 2x RTX 2080 Super : Ghetto 1060

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

Selecting ONLY GPU0 I can get lightweight (batch32), DF128 (10) and original working.

SAE with a batch of 1 fails.

Distributed - DFL-128 with batch of 8 running. This is wierd.

I'll try the alignment thing next.

User avatar
abigflea
Posts: 182
Joined: Sat Feb 22, 2020 10:59 pm
Answers: 2
Has thanked: 20 times
Been thanked: 62 times

Re: Distributed with Dual 2060 supers

Post by abigflea »

try ONLY gpu1 as well

:o I dunno what I'm doing :shock:
2X RTX 3090 : RTX 3080 : RTX: 2060 : 2x RTX 2080 Super : Ghetto 1060

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

Redid alignments.

Distributed, Original, Batch 128 - worked!

Distributed, DFL-H128, batch 32 - worked!

I'll test more, but you might have figured it out. THANKS

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

Distributed, DFL-SAE batch 2, FAILS.

Distributed, Villain batch 4, works.

Could it be an issue with SAE and distributed?

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

Distributed Dlight works too.

Question: Is it normal for it to take 15-25 mins to start training on distributed? I've got a pretty beefy system, just wondering what's normal?

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

I notice this A LOT in the verbose logs as it starts to train.

failed to allocate 3.24G (3477464576 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

Without MIXED PRECISION I see it a lot.

With it check just one or two instances. I beginning to think this is where the issue is lying.

Locked