Training speed tapers off after a two minutes

If training is failing to start, and you are not receiving an error message telling you what to do, tell us about it here


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
jcarl
Posts: 13
Joined: Sun Nov 13, 2022 6:41 am
Has thanked: 4 times
Been thanked: 1 time

Training speed tapers off after a two minutes

Post by jcarl »

Lately I've been having an issue with training performance that I'm struggling to isolate.

Training (DNY256 batch-26) starts out strong @ 100 eg/s, but after ~ 2 minutes, the speed starts gradually decreasing to about 55-60 eg/s, where it will stay permanently. Along with the performance drop, GPU power usage gradually decreases, starting around 170W and stabilizing around 110W. Temps reach ~ 70C, but stabilize at ~ 65C. Restarting the training session gets performance back up again 2 minutes at a time.

At first I thought the GPU was throttling, but the slowdown doesn't happen with an older model I used before (DFL-SAE standalone batch-16). With DFL-SAE, performance is consistent, as are GPU power draw and temps, which stabilize at higher levels @ 180W and 75C.

Am I hitting some kind of bottleneck when I training with DNY-256? I feel like this is a relatively recent occurrence.

GPU is a 3060ti, and data is stored on an NVME drive.

Has anyone encountered anything like this?

User avatar
bryanlyon
Site Admin
Posts: 793
Joined: Fri Jul 12, 2019 12:49 am
Answers: 44
Location: San Francisco
Has thanked: 4 times
Been thanked: 218 times
Contact:

Re: Training speed tapers off after a two minutes

Post by bryanlyon »

It's really hard to say. It's very possible that it's your CPU being the bottleneck, or even the PCI-E interface. When training, Faceswap has to load the images, warp them, augment them, and generally prepare them for the GPU before transferring them over.

DNY256 is a high resolution model and we're having to transfer not just the images across, but also masks and other data. This takes a lot of CPU, memory, and GPU bandwidth all together at once. It's quite possible that during startup the CPU is able to prepare enough data that after the initial load the GPU is actually just playing catchup for those 2 minutes until it is running faster than the rest of the system can send it data to keep it busy.

A lot of work has been done to make every stage of FaceSwap as fast as possible, but with infinite variations of hardware it's impossible to say for sure what could be at fault for something like you're describing. In general, we suggest going with "long run" eg'sec as your benchmark, not the first (or last) 2 minutes of a training session.

User avatar
jcarl
Posts: 13
Joined: Sun Nov 13, 2022 6:41 am
Has thanked: 4 times
Been thanked: 1 time

Re: Training speed tapers off after a two minutes

Post by jcarl »

Thanks for the reply. The fact that this might be indicative of a bottleneck gives me a place to start.

Hardware wise, I'm running a relatively new i7-12700k with 32GB of DDR4 3600 memory. CPU utilization itself is low, running about 14% during training, but I'm not sure about memory bandwidth. I wonder though if I should have opted for a DDR5 setup.

User avatar
jcarl
Posts: 13
Joined: Sun Nov 13, 2022 6:41 am
Has thanked: 4 times
Been thanked: 1 time

Re: Training speed tapers off after a two minutes

Post by jcarl »

Problem solved. The low/tapering performance was caused by using vgg-clear for the training mask. After switching to bisenet-fp, performance has gone back up to 110 eg/s (sustained over 14 hours so far).

User avatar
torzdf
Posts: 2670
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 131 times
Been thanked: 625 times

Re: Training speed tapers off after a two minutes

Post by torzdf »

Very strange indeed. I'm not sure cause=effect here, as the type of mask shouldn't matter. They are all just 1's and 0's, and are loaded in the same way.

My word is final

Locked