Distributed Training does not start

If training is failing to start, and you are not receiving an error message telling you what to do, tell us about it here


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
xsiox
Posts: 1
Joined: Tue Dec 15, 2020 4:55 pm

Distributed Training does not start

Post by xsiox »

i have 4 Geforce 1080 ti in my rig...

when i start training without distributed than it works..

when i activate distributed and start training...

it shows all gpus in the log window... but it does not start training....

it says waiting for preview... and waits..

now i am waiting 2 hours...

still not started..

what could be the problem..

User avatar
torzdf
Posts: 2667
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 131 times
Been thanked: 625 times

Re: Distributed Training does not start

Post by torzdf »

[mention]abigflea[/mention] is our multi-gpu expert so hopefully he will weigh in.

My guess would be that you are being bottlenecked by the bandwidth of your PCIe lanes. Try running 2 GPUs on known fast lanes and work out from there.

My word is final

User avatar
abigflea
Posts: 182
Joined: Sat Feb 22, 2020 10:59 pm
Answers: 2
Has thanked: 20 times
Been thanked: 62 times

Re: Distributed Training does not start

Post by abigflea »

I suppose I earned "expert" the hard way.
Torzdf is correct about the PCI lanes.

The cards need to communicate with each other. The performance is hindered by the slowest cards.

On typical mainboards the first PCIe slot will go 16x in the others are not populated.
If the 'first' 2 are populated, both cards will communicate at 8x usually, and this is just fine.
After that it, mainboards begin to vary a lot.
The 3rd and 4th slots may only be hardwired for 4x or 1x.
PCIE 4x is getting a bit slow for distributed.
If one card is running at 1x, all of them will be waiting on the slowest card.

For funzies, I tried a 2070 @ PCIe Gen 3 @ x16 + 1060 PCIE Gen 1 @ 1x (PCIe riser card like for bitcoin miner rigs).

Training took about 3 hours to start, and was slower than the 1060 training solo.
Once it started I think I was getting 1 iteration every 2-3 seconds.

:o I dunno what I'm doing :shock:
2X RTX 3090 : RTX 3080 : RTX: 2060 : 2x RTX 2080 Super : Ghetto 1060

Locked