Page 1 of 1

how does multi GPU training work

Posted: Fri Jul 03, 2020 11:34 pm
by police.bike

I am using vast to render multiple GPU processing.

I am curious how it works.

Say if I choose a 2 GPU machine and set bs to 64.
Does it branch 32 on one and another 32 on the other ?

Or does bs indicate a 64 on each GPU. Curious to know how it works.

It was surprising that a 2X 11 G NVIDIA machines gives lesser iterations than my single GPU NVIDIA 6G machine.


Re: how does multi GPU training work

Posted: Sat Jul 04, 2020 8:41 am
by torzdf

The batch is split between GPUs.

To get the benefit of multiple GPUs, you would want to up the batchsize (i.e. if you are training BS 64 on 1 GPU, you'd want to train BS 128 on 2 GPUs).


Re: how does multi GPU training work

Posted: Wed Jul 29, 2020 5:23 am
by ericpan0513

So, if I run the training with batchsize=64 in one GPU, then if I have 4 same GPUs I should increase batchsize to 256,right?
But I found that the maximum of batchsize setting is only up to 256, means that even if I got like 5 or 10 GPUs on one single machine, I still can not improve the speed(?) more than it did on 4 GPUs?
Hope you can answer, thanks!


Re: how does multi GPU training work

Posted: Wed Jul 29, 2020 10:58 pm
by bryanlyon

You can manually enter any number, the slider only stops there since it's a "normal" limit. If you need 512 you can enter 512, but remember that models stop learning details at very high batch sizes.


Re: how does multi GPU training work

Posted: Thu Jul 30, 2020 6:44 am
by ericpan0513

OK I see, thank you !
I thought that multiple GPUs is like if I use batchsize=256 on 4 GPUs, then the detail might be the same as batchsize=64 on 1GPU, so this is wrong, right?
If it's wrong, why is this happening? Don't models split in different parts and all using a batchsize of 64?(4GPUs Bs=256)


Re: how does multi GPU training work

Posted: Thu Jul 30, 2020 4:19 pm
by bryanlyon

Training uses two paths, one forward and one backward. Forward gets split to all GPUs, but the backward pass happens once with all the batches at once. As you increase the batch size, it's faster since it uses more images per backward pass, but it also becomes more noisy as the gradients interfere with each other. Splitting to multiple GPUs doesn't solve this interference issue.

Too small of batch sizes also have their problems. This is why we normally recommend batch sizes between 8 and 128.