Array Memory Error

Want to understand the training process better? Got tips for which model to use and when? This is the place for you


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for discussing tips and understanding the process involved with Training a Faceswap model.

If you have found a bug are having issues with the Training process not working, then you should post in the Training Support forum.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
MattB
Posts: 22
Joined: Fri Aug 19, 2022 4:54 pm
Been thanked: 5 times

Array Memory Error

Post by MattB »

I'm getting a consistent error when training. The training model is Phaze-A/DNY-512 with a batch size of 2, running on mirrored RTX3090's and 64GB of system RAM. I no longer have a second rig to see if it's unit specific.

The error in the crash report is "numpy.core.exceptions.ArrayMemoryError: Unable to allocate 323. MiB for an array with shape (7, 580, 4, 3, 580, 3) and data type float32" when I should have more than enough resources. In the command line window I recieve the error "libpng error: Read Error" over and over before the out of memory error occurs.

Typically the model runs with about 14Gb of system ram utilized. But it does creep up over roughly 36 hours to finally get an out of memory error that stops training. On the up-side, if I exit faceswap and restart, it recovers gracefully and training runs another 36-48 hours before getting the error again.

Typically I would suspect a memory leak of some sort. But given the technology stack I have no clue how to troubleshoot it at that level. Does anyone have an idea what could be the cause? Truth is, since it's just a matter of restarting I'm prone to just restart it once a day. But perhaps it's something simple that I just can't grok.

User avatar
torzdf
Posts: 2651
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 129 times
Been thanked: 622 times

Re: Array Memory Error

Post by torzdf »

If it is creeping up and ultimately running out of memory, then yes, I would tend to agree.

Unfortunately, this is not an issue I have seen before, and not something which should happen, so troubleshooting will be next to impossible, Most likely it is a leak inside a library we use, which really does not help narrow it down that much.

My word is final

Locked