Page 1 of 1

tensorflow.python.framework.errors_impl.InternalError: GPU sync failed

Posted: Sat Jun 13, 2020 11:13 pm
by kamapisachihd

I have been training my first model successfully for last few days, although post 300K iteration I have been running in the problem where training crashes abruptly, on first look its CUDA/CUDNN related errors. So I reinstalled everything from scratch assuming there was some problem with code itself but it did not solve it as I feared. I am assuming at this point this all related to memory saving options that I have to use with my Rtx 2080 and there might not be a solution for this but wanted to ask anyhow. Any insight would be helpful.

There is always some variation in the error thrown but majority has to do with CUDA/CUDNN.

Code: Select all

06/13/2020 18:41:45 INFO     [Saved models] - Average since last save: face_128_loss_A: 0.02321, face_128_loss_B: 0.02041

06/13/2020 18:44:40 INFO     [Saved models] - Average since last save: face_128_loss_A: 0.02316, face_128_loss_B: 0.02101

06/13/2020 18:47:42 INFO     [Saved models] - Average since last save: face_128_loss_A: 0.02334, face_128_loss_B: 0.02091

06/13/2020 18:56:33 INFO     [Saved models] - Average since last save: face_128_loss_A: 0.02278, face_128_loss_B: 0.02103


2020-06-13 18:57:09.575929: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2020-06-13 18:57:09.576102: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
Process exited.

Attaching Log file also:


Re: tensorflow.python.framework.errors_impl.InternalError: GPU sync failed

Posted: Sun Jun 14, 2020 8:37 am
by torzdf

If it worked, and now it's problematic, then it looks like a hardware issue to me:

https://www.google.com/search?client=fi ... ync+failed