Page 1 of 1

Distributed Training does not start

Posted: Wed Dec 16, 2020 3:32 am
by xsiox

i have 4 Geforce 1080 ti in my rig...

when i start training without distributed than it works..

when i activate distributed and start training...

it shows all gpus in the log window... but it does not start training....

it says waiting for preview... and waits..

now i am waiting 2 hours...

still not started..

what could be the problem..


Re: Distributed Training does not start

Posted: Wed Dec 16, 2020 12:49 pm
by torzdf

[mention]abigflea[/mention] is our multi-gpu expert so hopefully he will weigh in.

My guess would be that you are being bottlenecked by the bandwidth of your PCIe lanes. Try running 2 GPUs on known fast lanes and work out from there.


Re: Distributed Training does not start

Posted: Wed Dec 16, 2020 8:15 pm
by abigflea

I suppose I earned "expert" the hard way.
Torzdf is correct about the PCI lanes.

The cards need to communicate with each other. The performance is hindered by the slowest cards.

On typical mainboards the first PCIe slot will go 16x in the others are not populated.
If the 'first' 2 are populated, both cards will communicate at 8x usually, and this is just fine.
After that it, mainboards begin to vary a lot.
The 3rd and 4th slots may only be hardwired for 4x or 1x.
PCIE 4x is getting a bit slow for distributed.
If one card is running at 1x, all of them will be waiting on the slowest card.

For funzies, I tried a 2070 @ PCIe Gen 3 @ x16 + 1060 PCIE Gen 1 @ 1x (PCIe riser card like for bitcoin miner rigs).

Training took about 3 hours to start, and was slower than the 1060 training solo.
Once it started I think I was getting 1 iteration every 2-3 seconds.