Page 1 of 1

Locating Bottleneck between Training Iterations

Posted: Tue Jul 06, 2021 10:29 am
by ianstephens

We are noticing what we are perceiving as some kind of bottleneck between training iterations.

We've got a brand new 3090 FE installed in our test machine. When running StoJo model, for example - we are noticing peaks and settling in GPU usage between iterations.

For example, as the iteration is run, GPU peaks to 90+% and then drops back before peaking up again at the next iteration. Of course, there will be a slight delay between iterations but this seems excessive.

iterations.png
iterations.png (49.14 KiB) Viewed 7622 times

Where is the bottleneck? Our CPU is very clear during training - it's not taxed at all.

Is there any way to achieve faster loading of each iteration - the delay between iterations? Where is the bottleneck? Essentially, the GPU could be pushed a lot more but we can't find out where this delay is coming from. All of the training images are loaded into system file cache (RAM) so it's not even reading from disk.

Any advice or pointers are greatly appreciated.


Re: Locating Bottleneck between Training Iterations

Posted: Tue Jul 06, 2021 11:20 am
by ianstephens

Update/Test:

Lowering batch size from 21 to 6 brings much faster cycling through iterations with less delay (hardly at all) between them.

It still doesn't max out the GPU (also confirmed using nvidia-smi) but it cycles through faster.

batch.png
batch.png (38.12 KiB) Viewed 7618 times

So higher batch sizes creates a longer delay between running each iteration. But why? Where is the hold-up?

Perhaps the delay is because of the time it takes to send the entire batch of images required for each iteration across the system into the GPU for processing? Maybe our hardware is holding us back? Perhaps CPU clock speed? We have an abundance of power (2x 12 core Xeon) but perhaps it's the clock speed (only 2.7Ghz).


Re: Locating Bottleneck between Training Iterations

Posted: Thu Jul 08, 2021 10:49 am
by torzdf

I will need to look into + test this. My initial assumption would be memory copies though


Re: Locating Bottleneck between Training Iterations

Posted: Thu Jul 08, 2021 1:22 pm
by ianstephens

Agree - getting the batch of images to the GPU seems to be the slow part perhaps? Perhaps there are different/faster methods that could be coded. I'm no expert.

Thank you for taking a look into this. It would be super if we could make full use of the GPU and really heat things up with the crunching.

Here is some additional graphing from our other monitoring software. You can see the drops/peaks between the iterations.

graphing-new.jpg
graphing-new.jpg (132.33 KiB) Viewed 7569 times

Re: Locating Bottleneck between Training Iterations

Posted: Sat Jul 10, 2021 10:01 pm
by ianstephens

We are noticing the GPU usage drops between iterations in Linux too.

iterations-linux.jpg
iterations-linux.jpg (158.45 KiB) Viewed 7305 times

This is of course an understandable condition and expected behavior (loading each batch into the GPU memory between iterations). It's the loading of the batch images that seems to cause this behavior.

But I was thinking perhaps there is a better way... Perhaps running several GPU threads (for example 2 which would halve batch size) but increase GPU usage between 2x iterations.

My thoughts simply are the GPU could be utilized more. Especially with the new-gen 30XX series. We want to keep the GPUs at 100% processing capacity.

Just a thought - you are the experts and I'm sure you'll think of a solution :) - We're going to send our third donation via PayPal tomorrow to the project.

Many thanks,

Ian