Page 1 of 1

CuDNN error when using 2 GPUs

Posted: Sat Jan 25, 2020 8:49 am
by tochan

Hi,

i want to start a new Dlight try but it crash when i use 2 GPU's... (2x2080 RTX)
One GPU works...Some Ideas?

Code: Select all

2020-01-24 23:16:13.477575: E tensorflow/stream_executor/cuda/cuda_dnn.cc:82] CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(3765): 'cudnnPoolingForward( cudnn.handle(), pooling_desc.handle(), &alpha, src_desc.handle(), input_data.opaque(), &beta, dest_desc.handle(), output_data->opaque())'
01/24/2020 23:16:14 CRITICAL Error caught! Exiting...
01/24/2020 23:16:14 ERROR    Caught exception in thread: '_training_0'
Could not parse requirement: -umpy
Could not parse requirement: -pencv-python
01/24/2020 23:16:15 ERROR    Got Exception on main handler:
Traceback (most recent call last):
File "C:\Users\denni\faceswap\lib\cli.py", line 128, in execute_script
process.process()
File "C:\Users\denni\faceswap\scripts\train.py", line 159, in process
self._end_thread(thread, err)
File "C:\Users\denni\faceswap\scripts\train.py", line 199, in _end_thread
thread.join()
File "C:\Users\denni\faceswap\lib\multithreading.py", line 121, in join
raise thread.err[1].with_traceback(thread.err[2])
File "C:\Users\denni\faceswap\lib\multithreading.py", line 37, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\denni\faceswap\scripts\train.py", line 224, in _training
raise err
File "C:\Users\denni\faceswap\scripts\train.py", line 214, in _training
self._run_training_cycle(model, trainer)
File "C:\Users\denni\faceswap\scripts\train.py", line 303, in _run_training_cycle
trainer.train_one_step(viewer, timelapse)
File "C:\Users\denni\faceswap\plugins\train\trainer\_base.py", line 316, in train_one_step
raise err
File "C:\Users\denni\faceswap\plugins\train\trainer\_base.py", line 283, in train_one_step
loss[side] = batcher.train_one_batch()
File "C:\Users\denni\faceswap\plugins\train\trainer\_base.py", line 424, in train_one_batch
loss = self._model.predictors[self._side].train_on_batch(model_inputs, model_targets)
File "C:\Users\denni\MiniConda3\envs\faceswap\lib\site-packages\keras\engine\training.py", line 1217, in train_on_batch
outputs = self.train_function(ins)
File "C:\Users\denni\MiniConda3\envs\faceswap\lib\site-packages\keras\backend\tensorflow_backend.py", line 2715, in __call__
return self._call(inputs)
File "C:\Users\denni\MiniConda3\envs\faceswap\lib\site-packages\keras\backend\tensorflow_backend.py", line 2675, in _call
fetched = self._callable_fn(*array_vals)
File "C:\Users\denni\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\client\session.py", line 1439, in __call__
run_metadata_ptr)
File "C:\Users\denni\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: cudnn PoolForward launch failed
[[{{node replica_1/model_1/encoder/average_pooling2d_1/AvgPool}} = AvgPool[T=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 2, 2], padding="VALID", strides=[1, 1, 2, 2], _device="/job:localhost/replica:0/task:0/device:GPU:1"](training/Adam/gradients/replica_1/model_1/encoder/conv_128_0_conv2d/convolution_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer)]]
[[{{node loss/mul/_1041}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3971_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Re: Crash before training starts

Posted: Sat Jan 25, 2020 1:38 pm
by torzdf

Ok, you are in a bit of an annoying catch 22 situation here...

The error you are seeing is a memory error which happens on some GPU/OS/Tensorflow combinations. There is no obvious pattern to it, but the solution is to enable the "Allow Growth".

HOWEVER. You cannot use "Allow Growth" with multiple GPUs, so you will be forced to only use a single GPU.