I have been training my first model successfully for last few days, although post 300K iteration I have been running in the problem where training crashes abruptly, on first look its CUDA/CUDNN related errors. So I reinstalled everything from scratch assuming there was some problem with code itself but it did not solve it as I feared. I am assuming at this point this all related to memory saving options that I have to use with my Rtx 2080 and there might not be a solution for this but wanted to ask anyhow. Any insight would be helpful.
There is always some variation in the error thrown but majority has to do with CUDA/CUDNN.
Code: Select all
06/13/2020 18:41:45 INFO [Saved models] - Average since last save: face_128_loss_A: 0.02321, face_128_loss_B: 0.02041
06/13/2020 18:44:40 INFO [Saved models] - Average since last save: face_128_loss_A: 0.02316, face_128_loss_B: 0.02101
06/13/2020 18:47:42 INFO [Saved models] - Average since last save: face_128_loss_A: 0.02334, face_128_loss_B: 0.02091
06/13/2020 18:56:33 INFO [Saved models] - Average since last save: face_128_loss_A: 0.02278, face_128_loss_B: 0.02103
2020-06-13 18:57:09.575929: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2020-06-13 18:57:09.576102: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
Process exited.
Attaching Log file also: