Error During Training: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

If training is failing to start, and you are not receiving an error message telling you what to do, tell us about it here


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
MagicRabbit
Posts: 3
Joined: Fri Sep 03, 2021 1:08 am
Answers: 1
Has thanked: 1 time

Error During Training: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

Post by MagicRabbit »

Code: Select all

09/02/2021 23:06:57 INFO     Log level set to: INFO
09/02/2021 23:06:58 INFO     Model A Directory: 'H:\DOCS\PROJECTS\FACESWAP_PY\proj1\FaceA' (1230 images)
09/02/2021 23:06:58 INFO     Model B Directory: 'H:\DOCS\PROJECTS\FACESWAP_PY\proj1\FaceB' (4744 images)
09/02/2021 23:06:58 INFO     Training data directory: H:\DOCS\PROJECTS\FACESWAP_PY\proj1\ModelAB
09/02/2021 23:06:58 INFO     ===================================================
09/02/2021 23:06:58 INFO       Starting
09/02/2021 23:06:58 INFO       Press 'Stop' to save and quit
09/02/2021 23:06:58 INFO     ===================================================
09/02/2021 23:06:59 INFO     Loading data, this may take a while...
09/02/2021 23:06:59 INFO     Loading Model from Original plugin...
09/02/2021 23:06:59 INFO     Using configuration saved in state file
09/02/2021 23:07:00 INFO     Loaded model from disk: 'H:\DOCS\PROJECTS\FACESWAP_PY\proj1\ModelAB\original.h5'
09/02/2021 23:07:00 INFO     Loading Trainer from Original plugin...

09/02/2021 23:07:12 INFO     [Saved models] - Average loss since last save: face_a: 0.02728, face_b: 0.02731

09/02/2021 23:09:07 INFO     [Saved models] - Average loss since last save: face_a: 0.03140, face_b: 0.02689

09/02/2021 23:11:01 INFO     [Saved models] - Average loss since last save: face_a: 0.03149, face_b: 0.02689

09/02/2021 23:12:56 INFO     [Saved models] - Average loss since last save: face_a: 0.03108, face_b: 0.02679

09/02/2021 23:14:02 INFO     Saved snapshot (25000 iterations)

09/02/2021 23:14:51 INFO     [Saved models] - Average loss since last save: face_a: 0.03093, face_b: 0.02667

09/02/2021 23:16:46 INFO     [Saved models] - Average loss since last save: face_a: 0.03093, face_b: 0.02696

09/02/2021 23:17:38 INFO     Saved project to: 'H:/DOCS/PROJECTS/FACESWAP_PY/proj1/facesw1.fsw'
09/02/2021 23:18:40 INFO     [Saved models] - Average loss since last save: face_a: 0.03105, face_b: 0.02693

09/02/2021 23:20:34 INFO     [Saved models] - Average loss since last save: face_a: 0.03096, face_b: 0.02650

09/02/2021 23:22:29 INFO     [Saved models] - Average loss since last save: face_a: 0.03088, face_b: 0.02661

09/02/2021 23:24:23 INFO     [Saved models] - Average loss since last save: face_a: 0.03057, face_b: 0.02655

09/02/2021 23:26:18 INFO     [Saved models] - Average loss since last save: face_a: 0.03062, face_b: 0.02630

2021-09-02 23:27:46.354831: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-09-02 23:27:46.355028: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:220] Unexpected Event status: 1
Process exited.

Hi

I've had this error pop up every two minutes to an hour during training sessions, which promptly ends them. It did happen once when extracting was almost finished as well. None of the methods I've come across in my research for dealing with this kind of issue seems to make any difference.

I am running this on Windows 10 installed to an SSD. My GPU is an MSI Geforce GTX 1050 Ti. It has 4 GB of VRAM. Yes, I bought it used. Aside from this, it has not given me any problems. Here are all the methods for fixing this that I've tried so far. I've been troubleshooting this for six hours, so I may be leaving a few out:

  • Underclock GPU

  • Overclock GPU

  • Reduce GPU clock to remove the factory overclock present on some models of my GPU

  • Underclock CPU (line of thinking was that it may have been a PSU issue; my PSU is only 96w more than required for all of my components and I've been overclocking my processor)

  • Use lowmem mode for the trainer I'm using, original (made the error happen quicker)

  • Installed Faceswap.py to OS and non-OS drives, sitting on different areas of the OS drive (ie. in the default file path, in C:\ and in root)

  • Left the PC completely alone and did not touch it after beginning the training process (seems to help, but gives the error after about an hour)

"OP, your problem sounds a lot like this guy's. Learn to use the search bar lol" Yes, nothing I've tried from there has helped and as far as I'm aware, this person never solved their problem either. The only thing I haven't been able to try yet is upgrading my PSU, which I plan to do somewhat soon anyways, but I strongly suspect I will run into the same problems.

I am running out of ideas. A year ago, I ran this same tool off of an Intel laptop CPU for up to 15 hours at a time, and never once got any kind of error. The only thing I've yet to try is some of the lighter-weight trainers, but I read that the method they use can create significantly lower quality face swaps, and at that point, why bother? Is 4GB of VRAM really not enough for the default method? And if it comes to it, would it be a bad idea to change the trainer I'm using in the middle of face training? I'm already 26k frames in.

by MagicRabbit » Sat Sep 04, 2021 2:16 pm

Last night, after leaving my PC untouched with no other windows open and faceswap.py's windows minimized, it ran for 9 hours uninterrupted and did not have any kind of errors. I've tested this method for about four hours at a time while I'm out of the house and again, last night. I have no idea why this seems to work but I'm not complaining.

While I can't be sure this is a definite fix, I'm marking it as such anyways.

Go to full post
User avatar
torzdf
Posts: 2651
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 129 times
Been thanked: 622 times

Re: Error During Training: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

Post by torzdf »

MagicRabbit wrote: Fri Sep 03, 2021 3:32 am

[/code]

"OP, your problem sounds a lot like this guy's. Learn to use the search bar lol" Yes, nothing I've tried from there has helped and as far as I'm aware, this person never solved their problem either. The only thing I haven't been able to try yet is upgrading my PSU, which I plan to do somewhat soon anyways, but I strongly suspect I will run into the same problems.

Appreciated ;) Ultimately, this really does look like a hardware issue. I wish I could give you more to go on than that. The error is exactly what it says it is. Tensorflow is trying to access memory which it either can't or shouldn't. Most likely the former, seeing as this error seems very rare. Another thing you could try is running Faceswap in Ubuntu (either dual booting, or liveUSB should be fine... although not tested the latter). I would not use LiveUSB as a permanent solution, but it would at least help identify whether it is a specific Windows issue.

I plan to update the Tensorflow version Faceswap uses in the coming days/weeks, so that may or may not resolve your issue.

I wish I could give you more, but ultimately this comes from a communication issue between Tensorflow and your hardware, which is upstream from us.

The only thing I've yet to try is some of the lighter-weight trainers, but I read that the method they use can create significantly lower quality face swaps, and at that point, why bother? Is 4GB of VRAM really not enough for the default method? And if it comes to it, would it be a bad idea to change the trainer I'm using in the middle of face training? I'm already 26k frames in.

4 GB should be fine for original model. Hell, you should be able to run Dfaker or Phaze-A default at 4GB (depending on OS reserving VRAM). You can change trainer just fine as it will just start a new model (it is not possible to change trainer on an existing model), which may be fine for testing to find your issue.

My word is final

User avatar
MagicRabbit
Posts: 3
Joined: Fri Sep 03, 2021 1:08 am
Answers: 1
Has thanked: 1 time

Re: Error During Training: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

Post by MagicRabbit »

torzdf wrote: Fri Sep 03, 2021 9:13 am
MagicRabbit wrote: Fri Sep 03, 2021 3:32 am

[/code]

"OP, your problem sounds a lot like this guy's. Learn to use the search bar lol" Yes, nothing I've tried from there has helped and as far as I'm aware, this person never solved their problem either. The only thing I haven't been able to try yet is upgrading my PSU, which I plan to do somewhat soon anyways, but I strongly suspect I will run into the same problems.

Appreciated ;) Ultimately, this really does look like a hardware issue. I wish I could give you more to go on than that. The error is exactly what it says it is. Tensorflow is trying to access memory which it either can't or shouldn't. Most likely the former, seeing as this error seems very rare. Another thing you could try is running Faceswap in Ubuntu (either dual booting, or liveUSB should be fine... although not tested the latter). I would not use LiveUSB as a permanent solution, but it would at least help identify whether it is a specific Windows issue.

I plan to update the Tensorflow version Faceswap uses in the coming days/weeks, so that may or may not resolve your issue.

I wish I could give you more, but ultimately this comes from a communication issue between Tensorflow and your hardware, which is upstream from us.

The only thing I've yet to try is some of the lighter-weight trainers, but I read that the method they use can create significantly lower quality face swaps, and at that point, why bother? Is 4GB of VRAM really not enough for the default method? And if it comes to it, would it be a bad idea to change the trainer I'm using in the middle of face training? I'm already 26k frames in.

4 GB should be fine for original model. Hell, you should be able to run Dfaker or Phaze-A default at 4GB (depending on OS reserving VRAM). You can change trainer just fine as it will just start a new model (it is not possible to change trainer on an existing model), which may be fine for testing to find your issue.

Hey, thank you a ton for taking the time to talk to me about this. I will definitely try to dual boot some distro on this machine. I would be right now anyways, but when I first built it the motherboard's MSI bios was extremely fussy about booting anything that wasn't Windows 10, legacy boot enabled or not. This gives me incentive to give it another shot, though. Take care.

User avatar
MagicRabbit
Posts: 3
Joined: Fri Sep 03, 2021 1:08 am
Answers: 1
Has thanked: 1 time

Re: Error During Training: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

Post by MagicRabbit »

Last night, after leaving my PC untouched with no other windows open and faceswap.py's windows minimized, it ran for 9 hours uninterrupted and did not have any kind of errors. I've tested this method for about four hours at a time while I'm out of the house and again, last night. I have no idea why this seems to work but I'm not complaining.

While I can't be sure this is a definite fix, I'm marking it as such anyways.

Locked