tensorflow.python.framework.errors_impl.InternalError: GPU sync failed

If training is failing to start, and you are not receiving an error message telling you what to do, tell us about it here


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
kamapisachihd
Posts: 3
Joined: Sun Jun 07, 2020 12:39 am

tensorflow.python.framework.errors_impl.InternalError: GPU sync failed

Post by kamapisachihd »

I have been training my first model successfully for last few days, although post 300K iteration I have been running in the problem where training crashes abruptly, on first look its CUDA/CUDNN related errors. So I reinstalled everything from scratch assuming there was some problem with code itself but it did not solve it as I feared. I am assuming at this point this all related to memory saving options that I have to use with my Rtx 2080 and there might not be a solution for this but wanted to ask anyhow. Any insight would be helpful.

There is always some variation in the error thrown but majority has to do with CUDA/CUDNN.

Code: Select all

06/13/2020 18:41:45 INFO     [Saved models] - Average since last save: face_128_loss_A: 0.02321, face_128_loss_B: 0.02041

06/13/2020 18:44:40 INFO     [Saved models] - Average since last save: face_128_loss_A: 0.02316, face_128_loss_B: 0.02101

06/13/2020 18:47:42 INFO     [Saved models] - Average since last save: face_128_loss_A: 0.02334, face_128_loss_B: 0.02091

06/13/2020 18:56:33 INFO     [Saved models] - Average since last save: face_128_loss_A: 0.02278, face_128_loss_B: 0.02103


2020-06-13 18:57:09.575929: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2020-06-13 18:57:09.576102: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
Process exited.

Attaching Log file also:

Attachments
crash_report.2020.06.11.161626996942.log
(32.19 KiB) Downloaded 192 times
User avatar
torzdf
Posts: 2687
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 135 times
Been thanked: 628 times

Re: tensorflow.python.framework.errors_impl.InternalError: GPU sync failed

Post by torzdf »

If it worked, and now it's problematic, then it looks like a hardware issue to me:

https://www.google.com/search?client=fi ... ync+failed

My word is final

Locked