Training speed over time / Training speed on resume

Want to understand the training process better? Got tips for which model to use and when? This is the place for you


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for discussing tips and understanding the process involved with Training a Faceswap model.

If you have found a bug are having issues with the Training process not working, then you should post in the Training Support forum.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
JimmyBoy
Posts: 8
Joined: Fri Jun 05, 2020 4:01 pm
Answers: 0

Training speed over time / Training speed on resume

Post by JimmyBoy »

I have a question regarding training speed over time. I believe that it is normal for the training to slow down over time, which is what I am experiencing. When starting off, my EG/s are around 120 EG/s, after about 48 hours they have dropped to 40 EG/s.

However... it confuses me that if I stop training and then resume training the EG/s goes back up to about 120 EG/s and it appears to still be learning ?, I would of expected upon resuming that it would continue at about 40 EG/s (what it was at when I stopped training)

Im sure it's me not understanding something correctly so would appreciate some insight from someone more knowledgable than me... have attached image of EG/s after 48h and then resuming, and graph of training loss which shows it appears to still be learning but much faster than when I stopped it (120 v 40 EG/s)

On a complete side note - do these EG/s numbers seem about right for a 1070Ti ?

Thanks in advance!

Image
Image

User avatar
bryanlyon
Site Admin
Posts: 805
Joined: Fri Jul 12, 2019 12:49 am
Answers: 44
Location: San Francisco
Has thanked: 4 times
Been thanked: 224 times
Contact:

Re: Training speed over time / Training speed on resume

Post by bryanlyon »

It looks to me like it may be a thermal issue. Possibly either your GPU or CPU is thermal throttling down to a lower speed over time and when allowed to cool can go back to full speed. If so, it may be that you need to add better cooling (One of the reasons we don't recommend training on laptops).

User avatar
JimmyBoy
Posts: 8
Joined: Fri Jun 05, 2020 4:01 pm
Answers: 0

Re: Training speed over time / Training speed on resume

Post by JimmyBoy »

Hmmmmm, so this is not normal ?

It's a dell r720 server with a GTX 1070Ti blower, should have sufficient cooling. Will investigate.

User avatar
JimmyBoy
Posts: 8
Joined: Fri Jun 05, 2020 4:01 pm
Answers: 0

Re: Training speed over time / Training speed on resume

Post by JimmyBoy »

Thanks for pointing this out to me. After some investigating it appears the GPU is hitting its thermal limit, the card appears to have quite a relaxed fan profile as default with only 58% fan speed at 100% gpu usage.

Ive modified the fan curve so the card is not hitting its thermal limit, and will also make some changes to the r720 system fans to be more aggressive and aid cooling the GPU.

Will see how it is performing in 24 hours. Thanks for pointing me in the right direction.

User avatar
JimmyBoy
Posts: 8
Joined: Fri Jun 05, 2020 4:01 pm
Answers: 0

Re: Training speed over time / Training speed on resume

Post by JimmyBoy »

Thought I would follow up with some findings for future reference and anyone reading this post.

Using the original model with a batch size of 64, I was finding the EG/s would slow from 120 to 40 - after advice and installing GPU-Z and Afterburner, I noticed my GPU was hitting its thermal limit of 83c.

GPU-Z would show the below, note the 'CPU Load' is pretty much always 100% except for every 100 iterations when faceswap would save the model.
Image

After approx 2 hours, iterations would slow quite significantly and GPU-Z Load would have lots of peaks and troughs like below.
Image

So... I tried resolving this by addressing any potential cooling issues, starting by setting the GPU fan to 100% (made no difference), then by cranking up the fans on the r720 to Max, which sounds like a jet taking off, and keeps the GPU at a relatively consistent 60c @ 100% load (23c below its thermal limit of 83c) - same results still get GPU load issues after about 2 hours.

Below is a snapshot of Afterburner after about 2 hours use (5 sec snapshot intervals) and you can see the Max Temp of the GPU is 63c which I would think is fine. Also... you can see the temp stays roughly the same, but GPU load starts peaking and troughing, increasing in frequency over time.
Image

This would suggest to me that thermals is not the cause of this performance degradation. As an experiment, I tried changing the batch size from 64 to 8 and have now been running for 21 hours and am seeing no performance degradation. The below graph shows the GPU load is constant after 21 hours
Image

This machine has a 1070Ti 8GB, 24 virtual cores (Xeon E5-2650 v2 @ 2.6GHz) and 32GB RAM, max memory used is about 12GB.

I dont know if this is quirk, feature or bug of the original trainer, but I'm not too concerned about this as I think I will switch to using a different trainer, just thought I would share my observations. If anyone knows of something I have done wrong, would be glad to be made aware.

Thanks

User avatar
bryanlyon
Site Admin
Posts: 805
Joined: Fri Jul 12, 2019 12:49 am
Answers: 44
Location: San Francisco
Has thanked: 4 times
Been thanked: 224 times
Contact:

Re: Training speed over time / Training speed on resume

Post by bryanlyon »

This wont change if you switch to a different model. They share almost all code. I'm not sure why yours is still slowing down, but it's never happened to me (and I've never heard it happening to anyone else) so I'm betting it's something about your specific system or setup. Unfortunately, that makes it obviously very difficult to diagnose since you're really the only person who can test things. I'd suggest trying some more tweaks, maybe try reinstalling or eliminating variables. For example, if all data is on a spinning hard drive, maybe try an SSD. If you have enough RAM, maybe try a ramdisk. Things like that to see if you can narrow down when and where the problem shows up.

User avatar
JimmyBoy
Posts: 8
Joined: Fri Jun 05, 2020 4:01 pm
Answers: 0

Re: Training speed over time / Training speed on resume

Post by JimmyBoy »

Noted, thanks for the advice.

Why would the issue essentially go away when changing from batch size of 64 to 8? Does this give any clue, to me it would suggest some kind of memory issue, maybe GFX card is getting clogged up over time, although the memory usage is always at about 7.2GB of 8GB used.

Might try a few different batch sizes (48, 32, 24 and 16) to see where this issue goes away.

User avatar
torzdf
Posts: 2796
Joined: Fri Jul 12, 2019 12:53 am
Answers: 160
Has thanked: 142 times
Been thanked: 650 times

Re: Training speed over time / Training speed on resume

Post by torzdf »

Tensorflow will grab all VRAM available, regardless of whether it needs it all (hence the usage remaining the same).

You can enable "Allow Growth" to get a slightly better idea of actual VRAM usage.

My word is final

Locked