To get the benefit of multiple GPUs, you would want to up the batchsize (i.e. if you are training BS 64 on 1 GPU, you'd want to train BS 128 on 2 GPUs).
So, if I run the training with batchsize=64 in one GPU, then if I have 4 same GPUs I should increase batchsize to 256,right?
But I found that the maximum of batchsize setting is only up to 256, means that even if I got like 5 or 10 GPUs on one single machine, I still can not improve the speed(?) more than it did on 4 GPUs?
Hope you can answer, thanks!
You can manually enter any number, the slider only stops there since it's a "normal" limit. If you need 512 you can enter 512, but remember that models stop learning details at very high batch sizes.
OK I see, thank you !
I thought that multiple GPUs is like if I use batchsize=256 on 4 GPUs, then the detail might be the same as batchsize=64 on 1GPU, so this is wrong, right?
If it's wrong, why is this happening? Don't models split in different parts and all using a batchsize of 64?(4GPUs Bs=256)
Training uses two paths, one forward and one backward. Forward gets split to all GPUs, but the backward pass happens once with all the batches at once. As you increase the batch size, it's faster since it uses more images per backward pass, but it also becomes more noisy as the gradients interfere with each other. Splitting to multiple GPUs doesn't solve this interference issue.
Too small of batch sizes also have their problems. This is why we normally recommend batch sizes between 8 and 128.