lost several days of work to a bug

If training is failing to start, and you are not receiving an error message telling you what to do, tell us about it here


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
korrupt78
Posts: 50
Joined: Wed Jan 29, 2020 1:34 am
Has thanked: 2 times
Been thanked: 1 time

lost several days of work to a bug

Post by korrupt78 »

Twice now I've lost several days of work when my model was corrupted by an improperly handled memory allocation error while I'm sleeping such that it continues running long enough for the corruption to get saved over the backup. I don't know why the error is happening, and I don't know why the corruption isn't noticed by faceswap which instead just happily goes ahead and ruins the backup copy an hour or two later.

This is very frustrating.

I should mention I'm using the latest version of faceswap on Ubuntu 19.10 and a Gigabyte GeForce GTX 960 w/4GB GDDR5.

Attachments
Screenshot from 2020-02-04 00-22-33.png
Screenshot from 2020-02-04 00-22-33.png (110.86 KiB) Viewed 3993 times
User avatar
korrupt78
Posts: 50
Joined: Wed Jan 29, 2020 1:34 am
Has thanked: 2 times
Been thanked: 1 time

Re: lost several days of work to a bug

Post by korrupt78 »

I just noticed the 7_snapshot_25001_iters directory and managed to recover from that, so I only get set back to 25001 rather than 1, which is still a huge loss from 41807. (Almost a day.)

Luckily I have a Gigabyte GeForce RTX 2060 Super w/8GB GDDR6 arriving tomorrow which will hopefully run faster.

User avatar
torzdf
Posts: 2649
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 128 times
Been thanked: 622 times

Re: lost several days of work to a bug

Post by torzdf »

The backup function only occurs when the average loss for a save session is lower than the lowest average loss so far.

In all instances I have seen this is sufficient. In your case it looks like the loss has gone into negative figures, which I literally have never seen before, and did not even know was possible.

Snapshots will still be valid, so you can always restore from a snapshot (assuming you have one).

Model corruption (for most models) tends to be a sign of a hardware issue (e.g. overclocking or Power).


Edit, just saw your edit re: snapshots, so glad you found it.

Please do let us know if this issue persists with your new card.

My word is final

User avatar
korrupt78
Posts: 50
Joined: Wed Jan 29, 2020 1:34 am
Has thanked: 2 times
Been thanked: 1 time

Re: lost several days of work to a bug

Post by korrupt78 »

torzdf wrote: Tue Feb 04, 2020 8:56 am

Model corruption (for most models) tends to be a sign of a hardware issue (e.g. overclocking or Power).
...
Please do let us know if this issue persists with your new card.

Well, I still want to use the old card in addition to the new one, so I guess I need to figure this out.

I'm not overclocking or doing anything fancy, and I believe should have more than enough power. That desktop has a 500W power supply. The GTX 960 recommends a 400W power supply. The only other components drawing real power would be the CPU (65W) and a couple of hard drives. I even have the monitors off - I'm running jobs remotely via SSH.

So how could power be an issue under these circumstances? And is there a rational way to actually diagnose that possibility in a definitive manner?

User avatar
bryanlyon
Site Admin
Posts: 793
Joined: Fri Jul 12, 2019 12:49 am
Answers: 44
Location: San Francisco
Has thanked: 4 times
Been thanked: 218 times
Contact:

Re: lost several days of work to a bug

Post by bryanlyon »

550w with a 400w GPU is pretty tight. I'd suggest underclocking the card slightly to ensure that you're not going over. Also, make sure that your GPU isn't "factory overclocked" as even that is likely to be unstable when you're doing compute tasks.

User avatar
korrupt78
Posts: 50
Joined: Wed Jan 29, 2020 1:34 am
Has thanked: 2 times
Been thanked: 1 time

Re: lost several days of work to a bug

Post by korrupt78 »

Happened again. (screenshot attached)

It's always a memory allocation error. Is there a specific reason to think this power-related instead of memory-related? How do I determine how much memory is enough? (As mentioned above, this is happening on a GTX 960 w/4GB.)

Attachments
Screenshot from 2020-02-04 19-04-07.png
Screenshot from 2020-02-04 19-04-07.png (84.9 KiB) Viewed 3974 times
User avatar
torzdf
Posts: 2649
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 128 times
Been thanked: 622 times

Re: lost several days of work to a bug

Post by torzdf »

Wait, what? You're running the AMD version of the code???

CL_MEMORY_ALLOCATION is an out of memory error from OpenCL. Only the AMD version uses OpenCL.

Please post the output of "Help > Output System Info"

My word is final

User avatar
korrupt78
Posts: 50
Joined: Wed Jan 29, 2020 1:34 am
Has thanked: 2 times
Been thanked: 1 time

Re: lost several days of work to a bug

Post by korrupt78 »

torzdf wrote: Wed Feb 05, 2020 10:17 am

Wait, what? You're running the AMD version of the code???

CL_MEMORY_ALLOCATION is an out of memory error from OpenCL. Only the AMD version uses OpenCL.

Please post the output of "Help > Output System Info"

Ok, that explains a couple things. Sorry, I'm very new to all of this - I haven't even owned a video card in almost two decades because I'm not a gamer and usually make do with the on-board video. :)

My friend sold me a used GTX 960 so I could try out the faceswap app. I installed it in an unused desktop a few weeks ago, and began running jobs on it over SSH using the command line. (faceswap.py and tools.py) In fact, I've never run the GUI until just now, in order to output system info.

Anyway, when I installed faceswap that first time, I think I enabled AMD support because I didn't know that was bad. Earlier today I decided to wipe and do a fresh Ubuntu install on it and make it my new primary desktop, and this time I skipped AMD support, so hopefully that problem won't happen again.

I've attached a text file with the system info from the GUI. Hopefully I've got everything configured correctly now.

Attachments
sysinfo.txt
(13.25 KiB) Downloaded 206 times
User avatar
korrupt78
Posts: 50
Joined: Wed Jan 29, 2020 1:34 am
Has thanked: 2 times
Been thanked: 1 time

Re: lost several days of work to a bug

Post by korrupt78 »

Follow up - I've just determined that neither my motherboard (ASRock Fatal1ty B450 Gaming-ITX/AC) nor CPU (AMD Ryzen 5 1600) have on-board video, so I must have been using Nvidia the whole time, even though I accidentally enabled AMD support the first time I installed faceswap,.

Locked