I'm getting a consistent error when training. The training model is Phaze-A/DNY-512 with a batch size of 2, running on mirrored RTX3090's and 64GB of system RAM. I no longer have a second rig to see if it's unit specific.
The error in the crash report is "numpy.core.exceptions.ArrayMemoryError: Unable to allocate 323. MiB for an array with shape (7, 580, 4, 3, 580, 3) and data type float32" when I should have more than enough resources. In the command line window I recieve the error "libpng error: Read Error" over and over before the out of memory error occurs.
Typically the model runs with about 14Gb of system ram utilized. But it does creep up over roughly 36 hours to finally get an out of memory error that stops training. On the up-side, if I exit faceswap and restart, it recovers gracefully and training runs another 36-48 hours before getting the error again.
Typically I would suspect a memory leak of some sort. But given the technology stack I have no clue how to troubleshoot it at that level. Does anyone have an idea what could be the cause? Truth is, since it's just a matter of restarting I'm prone to just restart it once a day. But perhaps it's something simple that I just can't grok.