Training speed tapers off after a two minutes

jcarl · Post by **jcarl** » Tue Feb 21, 2023 7:45 pm

Lately I've been having an issue with training performance that I'm struggling to isolate.

Training (DNY256 batch-26) starts out strong @ 100 eg/s, but after ~ 2 minutes, the speed starts gradually decreasing to about 55-60 eg/s, where it will stay permanently. Along with the performance drop, GPU power usage gradually decreases, starting around 170W and stabilizing around 110W. Temps reach ~ 70C, but stabilize at ~ 65C. Restarting the training session gets performance back up again 2 minutes at a time.

At first I thought the GPU was throttling, but the slowdown doesn't happen with an older model I used before (DFL-SAE standalone batch-16). With DFL-SAE, performance is consistent, as are GPU power draw and temps, which stabilize at higher levels @ _180W and 75C.

Am I hitting some kind of bottleneck when I training with DNY-256? I feel like this is a relatively recent occurrence.

GPU is a 3060ti, and data is stored on an NVME drive.

Has anyone encountered anything like this?

Post by **bryanlyon** » Tue Feb 21, 2023 11:14 pm

It's really hard to say. It's very possible that it's your CPU being the bottleneck, or even the PCI-E interface. When training, Faceswap has to load the images, warp them, augment them, and generally prepare them for the GPU before transferring them over.

DNY256 is a high resolution model and we're having to transfer not just the images across, but also masks and other data. This takes a lot of CPU, memory, and GPU bandwidth all together at once. It's quite possible that during startup the CPU is able to prepare enough data that after the initial load the GPU is actually just playing catchup for those 2 minutes until it is running faster than the rest of the system can send it data to keep it busy.

A lot of work has been done to make every stage of FaceSwap as fast as possible, but with infinite variations of hardware it's impossible to say for sure what could be at fault for something like you're describing. In general, we suggest going with "long run" eg'sec as your benchmark, not the first (or last) 2 minutes of a training session.

jcarl · Post by **jcarl** » Wed Feb 22, 2023 3:12 am

Thanks for the reply. The fact that this might be indicative of a bottleneck gives me a place to start.

Hardware wise, I'm running a relatively new i7-12700k with 32GB of DDR4 3600 memory. CPU utilization itself is low, running about 14% during training, but I'm not sure about memory bandwidth. I wonder though if I should have opted for a DDR5 setup.

jcarl · Post by **jcarl** » Sat Feb 25, 2023 4:59 pm

Problem solved. The low/tapering performance was caused by using vgg-clear for the training mask. After switching to bisenet-fp, performance has gone back up to 110 eg/s (sustained over 14 hours so far).

Post by **torzdf** » Sat Feb 25, 2023 5:36 pm

Very strange indeed. I'm not sure cause=effect here, as the type of mask shouldn't matter. They are all just 1's and 0's, and are loaded in the same way.

Faceswap Forum

Training speed tapers off after a two minutes

Training speed tapers off after a two minutes

Re: Training speed tapers off after a two minutes

Re: Training speed tapers off after a two minutes

Re: Training speed tapers off after a two minutes

Re: Training speed tapers off after a two minutes