CuDNN error when using 2 GPUs

If training is failing to start, and you are not receiving an error message telling you what to do, tell us about it here


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
tochan
Posts: 21
Joined: Sun Sep 22, 2019 8:17 am
Been thanked: 5 times

CuDNN error when using 2 GPUs

Post by tochan »

Hi,

i want to start a new Dlight try but it crash when i use 2 GPU's... (2x2080 RTX)
One GPU works...Some Ideas?

Code: Select all

2020-01-24 23:16:13.477575: E tensorflow/stream_executor/cuda/cuda_dnn.cc:82] CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(3765): 'cudnnPoolingForward( cudnn.handle(), pooling_desc.handle(), &alpha, src_desc.handle(), input_data.opaque(), &beta, dest_desc.handle(), output_data->opaque())'
01/24/2020 23:16:14 CRITICAL Error caught! Exiting...
01/24/2020 23:16:14 ERROR    Caught exception in thread: '_training_0'
Could not parse requirement: -umpy
Could not parse requirement: -pencv-python
01/24/2020 23:16:15 ERROR    Got Exception on main handler:
Traceback (most recent call last):
File "C:\Users\denni\faceswap\lib\cli.py", line 128, in execute_script
process.process()
File "C:\Users\denni\faceswap\scripts\train.py", line 159, in process
self._end_thread(thread, err)
File "C:\Users\denni\faceswap\scripts\train.py", line 199, in _end_thread
thread.join()
File "C:\Users\denni\faceswap\lib\multithreading.py", line 121, in join
raise thread.err[1].with_traceback(thread.err[2])
File "C:\Users\denni\faceswap\lib\multithreading.py", line 37, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\denni\faceswap\scripts\train.py", line 224, in _training
raise err
File "C:\Users\denni\faceswap\scripts\train.py", line 214, in _training
self._run_training_cycle(model, trainer)
File "C:\Users\denni\faceswap\scripts\train.py", line 303, in _run_training_cycle
trainer.train_one_step(viewer, timelapse)
File "C:\Users\denni\faceswap\plugins\train\trainer\_base.py", line 316, in train_one_step
raise err
File "C:\Users\denni\faceswap\plugins\train\trainer\_base.py", line 283, in train_one_step
loss[side] = batcher.train_one_batch()
File "C:\Users\denni\faceswap\plugins\train\trainer\_base.py", line 424, in train_one_batch
loss = self._model.predictors[self._side].train_on_batch(model_inputs, model_targets)
File "C:\Users\denni\MiniConda3\envs\faceswap\lib\site-packages\keras\engine\training.py", line 1217, in train_on_batch
outputs = self.train_function(ins)
File "C:\Users\denni\MiniConda3\envs\faceswap\lib\site-packages\keras\backend\tensorflow_backend.py", line 2715, in __call__
return self._call(inputs)
File "C:\Users\denni\MiniConda3\envs\faceswap\lib\site-packages\keras\backend\tensorflow_backend.py", line 2675, in _call
fetched = self._callable_fn(*array_vals)
File "C:\Users\denni\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\client\session.py", line 1439, in __call__
run_metadata_ptr)
File "C:\Users\denni\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: cudnn PoolForward launch failed
[[{{node replica_1/model_1/encoder/average_pooling2d_1/AvgPool}} = AvgPool[T=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 2, 2], padding="VALID", strides=[1, 1, 2, 2], _device="/job:localhost/replica:0/task:0/device:GPU:1"](training/Adam/gradients/replica_1/model_1/encoder/conv_128_0_conv2d/convolution_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer)]]
[[{{node loss/mul/_1041}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3971_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
User avatar
torzdf
Posts: 2649
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 128 times
Been thanked: 622 times

Re: Crash before training starts

Post by torzdf »

Ok, you are in a bit of an annoying catch 22 situation here...

The error you are seeing is a memory error which happens on some GPU/OS/Tensorflow combinations. There is no obvious pattern to it, but the solution is to enable the "Allow Growth".

HOWEVER. You cannot use "Allow Growth" with multiple GPUs, so you will be forced to only use a single GPU.

My word is final

Locked