Locating Bottleneck between Training Iterations

If training is failing to start, and you are not receiving an error message telling you what to do, tell us about it here


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.

Please mark any answers that fixed your problems so others can find the solutions.

Post Reply
User avatar
ianstephens
Posts: 67
Joined: Sun Feb 14, 2021 7:20 pm
Has thanked: 9 times

Locating Bottleneck between Training Iterations

Post by ianstephens »

We are noticing what we are perceiving as some kind of bottleneck between training iterations.

We've got a brand new 3090 FE installed in our test machine. When running StoJo model, for example - we are noticing peaks and settling in GPU usage between iterations.

For example, as the iteration is run, GPU peaks to ~90+% and then drops back before peaking up again at the next iteration. Of course, there will be a slight delay between iterations but this seems excessive.

iterations.png
iterations.png (49.14 KiB) Viewed 4633 times

Where is the bottleneck? Our CPU is very clear during training - it's not taxed at all.

Is there any way to achieve faster loading of each iteration - the delay between iterations? Where is the bottleneck? Essentially, the GPU could be pushed a lot more but we can't find out where this delay is coming from. All of the training images are loaded into system file cache (RAM) so it's not even reading from disk.

Any advice or pointers are greatly appreciated.


User avatar
ianstephens
Posts: 67
Joined: Sun Feb 14, 2021 7:20 pm
Has thanked: 9 times

Re: Locating Bottleneck between Training Iterations

Post by ianstephens »

Update/Test:

Lowering batch size from 21 to 6 brings much faster cycling through iterations with less delay (hardly at all) between them.

It still doesn't max out the GPU (also confirmed using nvidia-smi) but it cycles through faster.

batch.png
batch.png (38.12 KiB) Viewed 4629 times

So higher batch sizes creates a longer delay between running each iteration. But why? Where is the hold-up?

Perhaps the delay is because of the time it takes to send the entire batch of images required for each iteration across the system into the GPU for processing? Maybe our hardware is holding us back? Perhaps CPU clock speed? We have an abundance of power (2x 12 core Xeon) but perhaps it's the clock speed (only 2.7Ghz).


User avatar
torzdf
Posts: 1495
Joined: Fri Jul 12, 2019 12:53 am
Answers: 127
Has thanked: 51 times
Been thanked: 287 times

Re: Locating Bottleneck between Training Iterations

Post by torzdf »

I will need to look into + test this. My initial assumption would be memory copies though

My word is final


User avatar
ianstephens
Posts: 67
Joined: Sun Feb 14, 2021 7:20 pm
Has thanked: 9 times

Re: Locating Bottleneck between Training Iterations

Post by ianstephens »

Agree - getting the batch of images to the GPU seems to be the slow part perhaps? Perhaps there are different/faster methods that could be coded. I'm no expert.

Thank you for taking a look into this. It would be super if we could make full use of the GPU and really heat things up with the crunching.

Here is some additional graphing from our other monitoring software. You can see the drops/peaks between the iterations.

graphing-new.jpg
graphing-new.jpg (132.33 KiB) Viewed 4580 times

User avatar
ianstephens
Posts: 67
Joined: Sun Feb 14, 2021 7:20 pm
Has thanked: 9 times

Re: Locating Bottleneck between Training Iterations

Post by ianstephens »

We are noticing the GPU usage drops between iterations in Linux too.

iterations-linux.jpg
iterations-linux.jpg (158.45 KiB) Viewed 4316 times

This is of course an understandable condition and expected behavior (loading each batch into the GPU memory between iterations). It's the loading of the batch images that seems to cause this behavior.

But I was thinking perhaps there is a better way... Perhaps running several GPU threads (for example 2 which would halve batch size) but increase GPU usage between 2x iterations.

My thoughts simply are the GPU could be utilized more. Especially with the new-gen 30XX series. We want to keep the GPUs at 100% processing capacity.

Just a thought - you are the experts and I'm sure you'll think of a solution :) - We're going to send our third donation via PayPal tomorrow to the project.

Many thanks,

Ian


Post Reply