Failed to get convolution algorithm : CRASH

If training is failing to start, and you are not receiving an error message telling you what to do, tell us about it here


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
vicky_vigia
Posts: 7
Joined: Sat Jul 27, 2019 11:00 am
Has thanked: 2 times

Failed to get convolution algorithm : CRASH

Post by vicky_vigia »

Hi Team,

Here I am again, reporting a crash this time. See below, the crash report.
Please help me!

Thanks again,
Vicky

Code: Select all

08/01/2019 11:43:26 MainProcess     training_0      _base           __init__                  DEBUG    Initialized Trainer
08/01/2019 11:43:26 MainProcess     training_0      train           load_trainer              DEBUG    Loaded Trainer
08/01/2019 11:43:26 MainProcess     training_0      train           run_training_cycle        DEBUG    Running Training Cycle
08/01/2019 11:43:26 MainProcess     training_0      training_data   minibatch                 DEBUG    Launching minibatch generator for queue (side: 'a', is_display: False)
08/01/2019 11:43:26 MainProcess     training_0      _base           generate_preview          DEBUG    Generating preview
08/01/2019 11:43:26 MainProcess     training_0      _base           set_preview_feed          DEBUG    Setting preview feed: (side: 'a')
08/01/2019 11:43:26 MainProcess     training_0      _base           load_generator            DEBUG    Loading generator: a
08/01/2019 11:43:26 MainProcess     training_0      _base           load_generator            DEBUG    input_size: 64, output_shapes: [(64, 64, 3)]
08/01/2019 11:43:26 MainProcess     training_0      training_data   __init__                  DEBUG    Initializing TrainingDataGenerator: (model_input_size: 64, model_output_shapes: [(64, 64, 3)], training_opts: {'alignments': {'a': '/home/vicky/Vicky/Projects/facial/faceswap/dataset/cageO/alignments.json', 'b': '/home/vicky/Vicky/Projects/facial/faceswap/dataset/trumpO/alignments.json'}, 'preview_scaling': 0.5, 'warp_to_landmarks': False, 'augment_color': True, 'no_flip': False, 'pingpong': False, 'snapshot_interval': 25000, 'training_size': 256, 'no_logs': False, 'mask_type': None, 'coverage_ratio': 0.625}, landmarks: False, config: {'mask_type': None, 'icnr_init': False, 'conv_aware_init': False, 'subpixel_upscaling': False, 'reflect_padding': False, 'dssim_loss': True, 'penalized_mask_loss': True, 'preview_images': 14, 'zoom_amount': 5, 'rotation_range': 10, 'shift_range': 5, 'flip_chance': 50, 'color_lightness': 30, 'color_ab': 8, 'color_clahe_chance': 50, 'color_clahe_max_size': 4})
08/01/2019 11:43:26 MainProcess     training_0      training_data   set_mask_class            DEBUG    Mask class: None
08/01/2019 11:43:26 MainProcess     training_0      training_data   __init__                  DEBUG    Initializing ImageManipulation: (input_size: 64, output_shapes: [(64, 64, 3)], coverage_ratio: 0.625, config: {'mask_type': None, 'icnr_init': False, 'conv_aware_init': False, 'subpixel_upscaling': False, 'reflect_padding': False, 'dssim_loss': True, 'penalized_mask_loss': True, 'preview_images': 14, 'zoom_amount': 5, 'rotation_range': 10, 'shift_range': 5, 'flip_chance': 50, 'color_lightness': 30, 'color_ab': 8, 'color_clahe_chance': 50, 'color_clahe_max_size': 4})
08/01/2019 11:43:26 MainProcess     training_0      training_data   __init__                  DEBUG    Output sizes: [64]
08/01/2019 11:43:26 MainProcess     training_0      training_data   __init__                  DEBUG    Initialized ImageManipulation
08/01/2019 11:43:26 MainProcess     training_0      training_data   __init__                  DEBUG    Initialized TrainingDataGenerator
08/01/2019 11:43:26 MainProcess     training_0      training_data   minibatch_ab              DEBUG    Queue batches: (image_count: 319, batchsize: 14, side: 'a', do_shuffle: True, is_preview, True, is_timelapse: False)
08/01/2019 11:43:26 MainProcess     training_0      training_data   make_queues               DEBUG    ['preview_a_in', 'preview_a_out']
08/01/2019 11:43:26 MainProcess     training_0      queue_manager   get_queue                 DEBUG    QueueManager getting: 'preview_a_in'
08/01/2019 11:43:26 MainProcess     training_0      queue_manager   add_queue                 DEBUG    QueueManager adding: (name: 'preview_a_in', maxsize: 0)
08/01/2019 11:43:26 MainProcess     training_0      queue_manager   add_queue                 DEBUG    QueueManager added: (name: 'preview_a_in')
08/01/2019 11:43:26 MainProcess     training_0      queue_manager   get_queue                 DEBUG    QueueManager got: 'preview_a_in'
08/01/2019 11:43:26 MainProcess     training_0      queue_manager   get_queue                 DEBUG    QueueManager getting: 'preview_a_out'
08/01/2019 11:43:26 MainProcess     training_0      queue_manager   add_queue                 DEBUG    QueueManager adding: (name: 'preview_a_out', maxsize: 0)
08/01/2019 11:43:26 MainProcess     training_0      queue_manager   add_queue                 DEBUG    QueueManager added: (name: 'preview_a_out')
08/01/2019 11:43:26 MainProcess     training_0      queue_manager   get_queue                 DEBUG    QueueManager got: 'preview_a_out'
08/01/2019 11:43:26 MainProcess     training_0      training_data   minibatch_ab              DEBUG    Batch shapes: [(14, 256, 256, 3), (14, 64, 64, 3), (14, 64, 64, 3)]
08/01/2019 11:43:26 MainProcess     training_0      multithreading  __init__                  DEBUG    Initializing FixedProducerDispatcher: (method: '<bound method TrainingDataGenerator.load_batches of <lib.training_data.TrainingDataGenerator object at 0x7f13347c59b0>>', shapes: [(14, 256, 256, 3), (14, 64, 64, 3), (14, 64, 64, 3)], ctype: <class 'ctypes.c_float'>, workers: 1, buffers: None)
08/01/2019 11:43:26 MainProcess     training_0      multithreading  __init__                  DEBUG    Initialized FixedProducerDispatcher
08/01/2019 11:43:26 MainProcess     training_0      training_data   minibatch_ab              DEBUG    Batching to queue: (side: 'a', is_display: True)
08/01/2019 11:43:26 MainProcess     training_0      _base           set_preview_feed          DEBUG    Set preview feed. Batchsize: 14
08/01/2019 11:43:26 MainProcess     training_0      training_data   minibatch                 DEBUG    Launching minibatch generator for queue (side: 'a', is_display: True)
08/01/2019 11:43:26 SpawnProcess-4  MainThread      multithreading  _runner                   DEBUG    FixedProducerDispatcher worker for <bound method TrainingDataGenerator.load_batches of <lib.training_data.TrainingDataGenerator object at 0x7f6ebee575c0>> started
08/01/2019 11:43:26 SpawnProcess-4  MainThread      training_data   load_batches              DEBUG    Loading batch: (image_count: 319, side: 'a', is_display: True, do_shuffle: True)
08/01/2019 11:43:26 MainProcess     training_0      _base           largest_face_index        DEBUG    0
08/01/2019 11:43:26 MainProcess     training_0      deprecation     new_func                  WARNING  From /home/vicky/miniconda3/envs/env_faceswap/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\nInstructions for updating:\nUse tf.cast instead.
08/01/2019 11:43:29 MainProcess     training_0      training_data   join_subprocess           DEBUG    Joining FixedProducerDispatcher
08/01/2019 11:43:29 SpawnProcess-2  MainThread      training_data   load_batches              DEBUG    Finished batching: (epoch: 128, side: 'a', is_display: False)
08/01/2019 11:43:29 SpawnProcess-2  MainThread      multithreading  _runner                   DEBUG    FixedProducerDispatcher worker for <bound method TrainingDataGenerator.load_batches of <lib.training_data.TrainingDataGenerator object at 0x7f60dff6a550>> shutdown
08/01/2019 11:43:29 MainProcess     training_0      training_data   join_subprocess           DEBUG    Joined FixedProducerDispatcher
08/01/2019 11:43:29 MainProcess     training_0      training_data   join_subprocess           DEBUG    Joining FixedProducerDispatcher
08/01/2019 11:43:29 SpawnProcess-3  MainThread      training_data   load_batches              DEBUG    Finished batching: (epoch: 128, side: 'b', is_display: False)
08/01/2019 11:43:29 SpawnProcess-3  MainThread      multithreading  _runner                   DEBUG    FixedProducerDispatcher worker for <bound method TrainingDataGenerator.load_batches of <lib.training_data.TrainingDataGenerator object at 0x7f919a390550>> shutdown
08/01/2019 11:43:29 MainProcess     training_0      training_data   join_subprocess           DEBUG    Joined FixedProducerDispatcher
08/01/2019 11:43:29 MainProcess     training_0      multithreading  run                       DEBUG    Error in thread (training_0): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.\n	 [[{{node encoder/conv_0_conv2d/convolution}}]]\n	 [[{{node decoder_a/face_out/Sigmoid-2-0-TransposeNCHWToNHWC-LayoutOptimizer}}]]
08/01/2019 11:43:29 MainProcess     MainThread      train           monitor                   DEBUG    Thread error detected
08/01/2019 11:43:29 MainProcess     MainThread      train           monitor                   DEBUG    Closed Monitor
08/01/2019 11:43:29 MainProcess     MainThread      train           end_thread                DEBUG    Ending Training thread
08/01/2019 11:43:29 MainProcess     MainThread      train           end_thread                CRITICAL Error caught! Exiting...
08/01/2019 11:43:29 MainProcess     MainThread      multithreading  join                      DEBUG    Joining Threads: 'training'
08/01/2019 11:43:29 MainProcess     MainThread      multithreading  join                      DEBUG    Joining Thread: 'training_0'
08/01/2019 11:43:29 MainProcess     MainThread      multithreading  join                      ERROR    Caught exception in thread: 'training_0'
Traceback (most recent call last):
  File "/home/vicky/Vicky/Projects/facial/faceswap/lib/cli.py", line 122, in execute_script
    process.process()
  File "/home/vicky/Vicky/Projects/facial/faceswap/scripts/train.py", line 98, in process
    self.end_thread(thread, err)
  File "/home/vicky/Vicky/Projects/facial/faceswap/scripts/train.py", line 124, in end_thread
    thread.join()
  File "/home/vicky/Vicky/Projects/facial/faceswap/lib/multithreading.py", line 460, in join
    raise thread.err[1].with_traceback(thread.err[2])
  File "/home/vicky/Vicky/Projects/facial/faceswap/lib/multithreading.py", line 391, in run
    self._target(*self._args, **self._kwargs)
  File "/home/vicky/Vicky/Projects/facial/faceswap/scripts/train.py", line 150, in training
    raise err
  File "/home/vicky/Vicky/Projects/facial/faceswap/scripts/train.py", line 140, in training
    self.run_training_cycle(model, trainer)
  File "/home/vicky/Vicky/Projects/facial/faceswap/scripts/train.py", line 222, in run_training_cycle
    trainer.train_one_step(viewer, timelapse)
  File "/home/vicky/Vicky/Projects/facial/faceswap/plugins/train/trainer/_base.py", line 211, in train_one_step
    raise err
  File "/home/vicky/Vicky/Projects/facial/faceswap/plugins/train/trainer/_base.py", line 176, in train_one_step
    loss[side] = batcher.train_one_batch(do_preview)
  File "/home/vicky/Vicky/Projects/facial/faceswap/plugins/train/trainer/_base.py", line 276, in train_one_batch
    loss = self.model.predictors[self.side].train_on_batch(*batch)
  File "/home/vicky/miniconda3/envs/env_faceswap/lib/python3.6/site-packages/keras/engine/training.py", line 1217, in train_on_batch
    outputs = self.train_function(ins)
  File "/home/vicky/miniconda3/envs/env_faceswap/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "/home/vicky/miniconda3/envs/env_faceswap/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "/home/vicky/miniconda3/envs/env_faceswap/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1439, in __call__
    run_metadata_ptr)
  File "/home/vicky/miniconda3/envs/env_faceswap/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node encoder/conv_0_conv2d/convolution}}]]
	 [[{{node decoder_a/face_out/Sigmoid-2-0-TransposeNCHWToNHWC-LayoutOptimizer}}]]

============ System Information ============
encoding:            UTF-8
git_branch:          master
git_commits:         c3adc93 Update GUI Graph + Stats when model has finished saving. 2610eff Bugfix: GUI: Progress bar on times over 1 hour (extract/convert). c1c60a9 bugfix: Clip output from scaling in convert. 8b2f166 Update helptext for CA Initialization. b6c830c Bugfix: Alignments tool: Correctly set items attribute on Check job
gpu_cuda:            10.1
gpu_cudnn:           7.6.0
gpu_devices:         GPU_0: GeForce RTX 2080 Ti
gpu_devices_active:  GPU_0
gpu_driver:          418.56
gpu_vram:            GPU_0: 10986MB
os_machine:          x86_64
os_platform:         Linux-4.18.0-25-generic-x86_64-with-debian-buster-sid
os_release:          4.18.0-25-generic
py_command:          /home/vicky/Vicky/Projects/facial/faceswap/faceswap.py train -A /home/vicky/Vicky/Projects/facial/faceswap/dataset/cageO -B /home/vicky/Vicky/Projects/facial/faceswap/dataset/trumpO -m /home/vicky/Vicky/Projects/facial/faceswap/dataset/trump-cage-model -t original -s 100 -ss 25000 -bs 64 -it 1000000 -g 1 -ps 50 -L INFO -gui
py_conda_version:    conda 4.7.10
py_implementation:   CPython
py_version:          3.6.6
py_virtual_env:      True
sys_cores:           8
sys_processor:       x86_64
sys_ram:             Total: 32102MB, Available: 20682MB, Used: 10020MB, Free: 2429MB

=============== Pip Packages ===============
absl-py==0.7.1
astor==0.7.1
astroid==2.2.5
certifi==2019.6.16
cloudpickle==1.2.1
cycler==0.10.0
cytoolz==0.10.0
dask==2.1.0
decorator==4.4.0
fastcluster==1.1.25
ffmpy==0.2.2
gast==0.2.2
google-pasta==0.1.7
grpcio==1.14.1
h5py==2.9.0
imageio==2.5.0
imageio-ffmpeg==0.3.0
isort==4.3.21
joblib==0.13.2
Keras==2.2.4
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
kiwisolver==1.1.0
lazy-object-proxy==1.4.1
Markdown==3.1.1
matplotlib==2.2.2
mccabe==0.6.1
mock==3.0.5
networkx==2.3
numpy==1.16.2
nvidia-ml-py3==7.352.1
olefile==0.46
opencv-python==4.1.0.25
pathlib==1.0.1
Pillow==5.1.0
protobuf==3.8.0
psutil==5.6.3
pylint==2.3.1
pyparsing==2.4.0
python-dateutil==2.8.0
pytz==2019.1
PyWavelets==1.0.3
PyYAML==5.1.1
scikit-image==0.15.0
scikit-learn==0.21.2
scipy==1.3.0
six==1.12.0
tensorboard==1.13.1
tensorflow==1.13.1
tensorflow-estimator==1.13.0
termcolor==1.1.0
toolz==0.10.0
toposort==1.5
tornado==6.0.3
tqdm==4.32.1
typed-ast==1.4.0
Werkzeug==0.15.4
wrapt==1.11.2

============== Conda Packages ==============
# packages in environment at /home/vicky/miniconda3/envs/env_faceswap:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_tflow_select 2.1.0 gpu
absl-py 0.7.1 py36_0
astor 0.7.1 py36_0
blas 1.0 openblas
bzip2 1.0.8 h516909a_0 conda-forge c-ares 1.15.0 h7b6447c_1001
ca-certificates 2019.5.15 0
cairo 1.14.12 h77bcde2_0
certifi 2019.6.16 py36_1
cloudpickle 1.2.1 py_0
cudatoolkit 10.0.130 0
cudnn 7.6.0 cuda10.0_0
cupti 10.0.130 0
cycler 0.10.0 py36_0
cytoolz 0.10.0 py36h7b6447c_0
dask-core 2.1.0 py_0
dbus 1.13.2 hc3f9b76_0
decorator 4.4.0 py36_1
expat 2.2.5 he1b5a44_1003 conda-forge ffmpeg 4.0 h04d0a96_0
fontconfig 2.12.6 h49f89f6_0
freetype 2.8 hab7d2ae_1
gast 0.2.2 py36_0
gettext 0.19.8.1 hc5be6a0_1002 conda-forge giflib 5.1.9 h516909a_0 conda-forge glib 2.53.6 h5d9569c_2
gmp 6.1.2 hf484d3e_1000 conda-forge gnutls 3.6.5 hd3a4fd2_1002 conda-forge google-pasta 0.1.7 py_0
graphite2 1.3.13 hf484d3e_1000 conda-forge grpcio 1.14.1 py36h9ba97e2_0
gst-plugins-base 1.12.4 h33fb286_0
gstreamer 1.12.4 hb53b477_0
h5py 2.9.0 pypi_0 pypi harfbuzz 1.7.6 hc5b324e_0
hdf5 1.10.2 hba1933b_1
icu 58.2 h9c2bf20_1
imageio 2.5.0 py36_0
jasper 1.900.1 h07fcdf6_1006 conda-forge jpeg 9c h14c3975_1001 conda-forge keras 2.2.4 0
keras-applications 1.0.8 py_0
keras-base 2.2.4 py36_0
keras-preprocessing 1.1.0 py_1
kiwisolver 1.1.0 py36he6710b0_0
lame 3.100 h14c3975_1001 conda-forge libblas 3.8.0 10_openblas conda-forge libcblas 3.8.0 10_openblas conda-forge libedit 3.1.20181209 hc058e9b_0
libffi 3.2.1 hd88cf55_4
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
libiconv 1.15 h516909a_1005 conda-forge liblapack 3.8.0 10_openblas conda-forge liblapacke 3.8.0 10_openblas conda-forge libopenblas 0.3.6 h6e990d7_6 conda-forge libopus 1.3 h7b6447c_0
libpng 1.6.37 hed695b0_0 conda-forge libprotobuf 3.8.0 hd408876_0
libstdcxx-ng 9.1.0 hdf63c60_0
libtiff 4.0.10 h57b8799_1003 conda-forge libuuid 2.32.1 h14c3975_1000 conda-forge libvpx 1.7.0 h439df22_0
libwebp 1.0.2 h576950b_1 conda-forge libxcb 1.13 h14c3975_1002 conda-forge libxml2 2.9.9 hea5a465_1
lz4-c 1.8.3 he1b5a44_1001 conda-forge markdown 3.1.1 py36_0
matplotlib 2.2.2 py36h0e671d2_1
mock 3.0.5 py36_0
ncurses 6.1 he6710b0_1
nettle 3.4.1 h1bed415_1002 conda-forge networkx 2.3 py_0
numpy 1.16.4 py36h95a1406_0 conda-forge olefile 0.46 py36_0
openblas 0.3.6 h6e990d7_6 conda-forge opencv 3.4.1 py36h6fd60c2_1
openh264 1.8.0 hdbcaa40_1000 conda-forge openssl 1.0.2s h7b6447c_0
pathlib 1.0.1 py36_1
pcre 8.41 hf484d3e_1003 conda-forge pillow 5.1.0 py36h3deb7b8_0
pip 19.1.1 py36_0
pixman 0.38.0 h516909a_1003 conda-forge protobuf 3.8.0 py36he6710b0_0
pthread-stubs 0.4 h14c3975_1001 conda-forge pyparsing 2.4.0 py_0
pyqt 5.9.2 py36h751905a_0
python 3.6.6 h6e4f718_2
python-dateutil 2.8.0 py36_0
pytz 2019.1 py_0
pywavelets 1.0.3 py36hdd07704_1
pyyaml 5.1.1 py36h7b6447c_0
qt 5.9.4 h4e5bff0_0
readline 7.0 h7b6447c_5
scikit-image 0.15.0 py36he6710b0_0
scipy 1.3.0 py36he2b7bc3_0
setuptools 41.0.1 py36_0
sip 4.19.8 py36hf484d3e_0
six 1.12.0 py36_0
sqlite 3.29.0 h7b6447c_0
tensorboard 1.13.1 py36hf484d3e_0
tensorflow 1.13.1 gpu_py36h3991807_0
tensorflow-base 1.13.1 gpu_py36h8d69cac_0
tensorflow-estimator 1.13.0 py_0
tensorflow-gpu 1.13.1 h0d30ee6_0
termcolor 1.1.0 py36_1
tk 8.6.8 hbc83047_0
toolz 0.10.0 py_0
tornado 6.0.3 py36h7b6447c_0
tqdm 4.32.1 py_0
werkzeug 0.15.4 py_0
wheel 0.33.4 py36_0
wrapt 1.11.2 py36h7b6447c_0
x264 1!152.20180806 h14c3975_0 conda-forge xorg-kbproto 1.0.7 h14c3975_1002 conda-forge xorg-libice 1.0.10 h516909a_0 conda-forge xorg-libsm 1.2.3 h84519dc_1000 conda-forge xorg-libx11 1.6.8 h516909a_0 conda-forge xorg-libxau 1.0.9 h14c3975_0 conda-forge xorg-libxdmcp 1.1.3 h516909a_0 conda-forge xorg-libxext 1.3.4 h516909a_0 conda-forge xorg-libxrender 0.9.10 h516909a_1002 conda-forge xorg-renderproto 0.11.1 h14c3975_1002 conda-forge xorg-xextproto 7.3.0 h14c3975_1002 conda-forge xorg-xproto 7.0.31 h14c3975_1007 conda-forge xz 5.2.4 h14c3975_4
yaml 0.1.7 had09818_2
zlib 1.2.11 h7b6447c_3
zstd 1.4.0 h3b9ef0a_0 conda-forge
by torzdf » Thu Aug 01, 2019 10:30 am

In the first instance, try a reboot. This normally means something has gone wrong in Cuda/cuDNN which needs a reboot to fix.

Go to full post
User avatar
torzdf
Posts: 2651
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 129 times
Been thanked: 622 times

Re: Failed to get convolution algorithm : CRASH

Post by torzdf »

In the first instance, try a reboot. This normally means something has gone wrong in Cuda/cuDNN which needs a reboot to fix.

My word is final

User avatar
calipheron
Posts: 15
Joined: Thu May 14, 2020 7:39 pm
Has thanked: 1 time

[Mitigated] Brand new 2060 Super. Trouble training.

Post by calipheron »

Hello, I've just taken delivery of a new 2060 Super.

2060 Super is the current recommended baseline hardware for Faceswap. I assumed I'd be able to move up to slightly more hefty models than DFaker or Original, compared to the RX 580 I was using. I have the latest Studio drivers installed.

Trying to use DLight to train, absolutely stock default settings except 80% face coverage and extended mask.

Unless I use a batch size of 2 (!!), I get out of memory errors.
With a BS of 2, I'm getting EGs around 3-4.
This is under Windows 10, not running anything else in the background.
Ryzen 7 3700X, 32GB RAM.

I've had CUDA Sync Errors, and just had "CUDA_ERROR_LAUNCH_FAILED"

Any suggestions would be appreciated.

Last edited by calipheron on Sun Jun 21, 2020 10:25 am, edited 2 times in total.
User avatar
bryanlyon
Site Admin
Posts: 793
Joined: Fri Jul 12, 2019 12:49 am
Answers: 44
Location: San Francisco
Has thanked: 4 times
Been thanked: 218 times
Contact:

Re: Brand new 2060 Super. Trouble training.

Post by bryanlyon »

Windows 10 does reserve a hefty amount of memory for itself even if nothing else is getting run. That may be taking you down significantly, but I think you should be able to get higher BSes than 2. If you run nvidia-smi (try inside your faceswap env, and if it doesn't show up there, it's somewhere inside the c:/nvidia folder as well) it should tell you how much vram is available before the training starts.

You may also benefit from "allow-growth" that sometimes works around some memory issues, but keep an eye on it since it just starting is no guarantee it'll be stable.

User avatar
calipheron
Posts: 15
Joined: Thu May 14, 2020 7:39 pm
Has thanked: 1 time

Re: Brand new 2060 Super. Trouble training.

Post by calipheron »

Thanks Bryan.

nvidia-smi reports only 939MiB of 8192MiB is used. Well, that is with Firefox open.
I will try with "allow growth".

I don't mind slower training, but this is disappointing so far: I never had ONE single memory related error on my RX 580.

I just had this crash:

Code: Select all

2020-06-20 18:22:29.652652: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-06-20 18:22:29.652806: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
User avatar
bryanlyon
Site Admin
Posts: 793
Joined: Fri Jul 12, 2019 12:49 am
Answers: 44
Location: San Francisco
Has thanked: 4 times
Been thanked: 218 times
Contact:

Re: Brand new 2060 Super. Trouble training.

Post by bryanlyon »

Those errors could just mean that they weren't able to load the data that they have on the card, or could be that the drivers are messing up. My advice is to (in this order) install the latest drivers, reboot, and close all applications to try again.

User avatar
calipheron
Posts: 15
Joined: Thu May 14, 2020 7:39 pm
Has thanked: 1 time

Re: Brand new 2060 Super. Trouble training.

Post by calipheron »

Trying with allow growth, no other settings changed. Getting a lot of this:

Code: Select all

2020-06-20 18:29:20.692407: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 2.49G (2673629440 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

There's more:

Code: Select all

2020-06-20 18:29:10.471097: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2020-06-20 18:29:10.650982: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-06-20 18:29:11.485247: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows
Relying on driver to perform ptx compilation. This message will be only logged once.
2020-06-20 18:29:11.611848: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 2.49G (2673629440 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-06-20 18:29:18.005698: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.11GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
(... a few of these)
2020-06-20 18:29:18.460946: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

2020-06-20 18:29:19.920119: W tensorflow/core/common_runtime/bfc_allocator.cc:305] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
2020-06-20 18:29:19.946687: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 2.49G (2673629440 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(... many of these)
2020-06-20 18:29:21.211562: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 2.49G (2673629440 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(... a few of these)
2020-06-20 18:29:37.043250: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 2.49G (2673629440 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-06-20 18:29:37.043354: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.21MiB (rounded to 1267200).  Current allocation summary follows.
2020-06-20 18:29:37.043476: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (256): 	Total Chunks: 261, Chunks in use: 261. 65.3KiB allocated for chunks. 65.3KiB in use in bin. 9.5KiB client-requested in use in bin.
2020-06-20 18:29:37.043595: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (512): 	Total Chunks: 51, Chunks in use: 51. 25.5KiB allocated for chunks. 25.5KiB in use in bin. 25.5KiB client-requested in use in bin.
2020-06-20 18:29:37.043701: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1024): 	Total Chunks: 41, Chunks in use: 41. 41.8KiB allocated for chunks. 41.8KiB in use in bin. 41.0KiB client-requested in use in bin.
2020-06-20 18:29:37.043810: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (2048): 	Total Chunks: 63, Chunks in use: 63. 126.0KiB allocated for chunks. 126.0KiB in use in bin. 126.0KiB client-requested in use in bin.
2020-06-20 18:29:37.043918: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (4096): 	Total Chunks: 17, Chunks in use: 17. 93.0KiB allocated for chunks. 93.0KiB in use in bin. 93.0KiB client-requested in use in bin.
2020-06-20 18:29:37.044028: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (8192): 	Total Chunks: 13, Chunks in use: 13. 123.5KiB allocated for chunks. 123.5KiB in use in bin. 121.9KiB client-requested in use in bin.
2020-06-20 18:29:37.044411: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (16384): 	Total Chunks: 13, Chunks in use: 13. 224.5KiB allocated for chunks. 224.5KiB in use in bin. 224.5KiB client-requested in use in bin.
2020-06-20 18:29:37.044538: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (32768): 	Total Chunks: 4, Chunks in use: 4. 144.0KiB allocated for chunks. 144.0KiB in use in bin. 144.0KiB client-requested in use in bin.
2020-06-20 18:29:37.044657: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (65536): 	Total Chunks: 11, Chunks in use: 11. 736.0KiB allocated for chunks. 736.0KiB in use in bin. 736.0KiB client-requested in use in bin.
2020-06-20 18:29:37.044784: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (131072): 	Total Chunks: 23, Chunks in use: 23. 3.94MiB allocated for chunks. 3.94MiB in use in bin. 3.89MiB client-requested in use in bin.
2020-06-20 18:29:37.044911: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (262144): 	Total Chunks: 1, Chunks in use: 1. 384.0KiB allocated for chunks. 384.0KiB in use in bin. 384.0KiB client-requested in use in bin.
2020-06-20 18:29:37.045036: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (524288): 	Total Chunks: 34, Chunks in use: 34. 19.35MiB allocated for chunks. 19.35MiB in use in bin. 18.88MiB client-requested in use in bin.
2020-06-20 18:29:37.045162: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1048576): 	Total Chunks: 8, Chunks in use: 8. 9.67MiB allocated for chunks. 9.67MiB in use in bin. 9.67MiB client-requested in use in bin.
2020-06-20 18:29:37.045287: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (2097152): 	Total Chunks: 25, Chunks in use: 25. 57.40MiB allocated for chunks. 57.40MiB in use in bin. 56.25MiB client-requested in use in bin.
2020-06-20 18:29:37.045394: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (4194304): 	Total Chunks: 19, Chunks in use: 19. 94.38MiB allocated for chunks. 94.38MiB in use in bin. 92.63MiB client-requested in use in bin.
2020-06-20 18:29:37.045504: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (8388608): 	Total Chunks: 63, Chunks in use: 63. 566.00MiB allocated for chunks. 566.00MiB in use in bin. 562.50MiB client-requested in use in bin.
2020-06-20 18:29:37.045613: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (16777216): 	Total Chunks: 18, Chunks in use: 18. 380.05MiB allocated for chunks. 380.05MiB in use in bin. 374.26MiB client-requested in use in bin.
2020-06-20 18:29:37.045723: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (33554432): 	Total Chunks: 1, Chunks in use: 1. 32.00MiB allocated for chunks. 32.00MiB in use in bin. 18.00MiB client-requested in use in bin.
2020-06-20 18:29:37.045834: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (67108864): 	Total Chunks: 15, Chunks in use: 15. 1.50GiB allocated for chunks. 1.50GiB in use in bin. 1.50GiB client-requested in use in bin.
2020-06-20 18:29:37.045948: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (134217728): 	Total Chunks: 9, Chunks in use: 9. 1.44GiB allocated for chunks. 1.44GiB in use in bin. 1.15GiB client-requested in use in bin.
2020-06-20 18:29:37.046057: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (268435456): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-06-20 18:29:37.046161: I tensorflow/core/common_runtime/bfc_allocator.cc:885] Bin for 1.21MiB was 1.00MiB, Chunk State:
2020-06-20 18:29:37.046225: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 1048576
2020-06-20 18:29:37.046281: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000B0B800000 next 1 of size 1280
2020-06-20 18:29:37.046342: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000B0B800500 next 165 of size 256
(... many of these)
2020-06-20 18:29:37.054866: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000B0B800600 next 166 of size 256
2020-06-20 18:29:37.056429: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000B0B855B00 next 116 of size 256
2020-06-20 18:29:37.056503: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000B0B855C00 next 112 of size 256
2020-06-20 18:29:37.056578: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000B0B855D00 next 18446744073709551615 of size 697088
2020-06-20 18:29:37.056663: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 4194304
2020-06-20 18:29:37.056724: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000B0BA00000 next 18446744073709551615 of size 4194304
2020-06-20 18:29:37.056810: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 4194304
2020-06-20 18:29:37.056871: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000B0BE00000 next 113 of size 2048
2020-06-20 18:29:37.056943: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000B0BE00800 next 130 of size 1024
2020-06-20 18:29:37.057016: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000B0BE00C00 next 131 of size 256
2020-06-20 18:29:37.414634: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BE78BFC00 next 614 of size 5811200
2020-06-20 18:29:37.414706: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BE7E4A800 next 633 of size 24729600
2020-06-20 18:29:37.414779: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BE95E0000 next 628 of size 16384
2020-06-20 18:29:37.414850: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BE95E4000 next 583 of size 9437184
2020-06-20 18:29:37.414923: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BE9EE4000 next 698 of size 109051904
2020-06-20 18:29:37.414996: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BF06E4000 next 563 of size 105963520
2020-06-20 18:29:37.415070: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BF6BF2000 next 697 of size 65536
2020-06-20 18:29:37.437637: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BF6C02000 next 640 of size 9728
2020-06-20 18:29:37.437711: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BF6C04600 next 704 of size 1267200
2020-06-20 18:29:37.437785: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BF6D39C00 next 726 of size 6656
2020-06-20 18:29:37.437857: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BF6D3B600 next 569 of size 150994944
2020-06-20 18:29:37.437934: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BFFD3B600 next 729 of size 9437184
2020-06-20 18:29:37.438008: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C0063B600 next 653 of size 9437184
2020-06-20 18:29:37.438084: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C00F3B600 next 728 of size 224000
2020-06-20 18:29:37.438157: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C00F72100 next 665 of size 2048
2020-06-20 18:29:37.438229: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C00F72900 next 566 of size 9437184
2020-06-20 18:29:37.438302: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C01872900 next 585 of size 2048
2020-06-20 18:29:37.438375: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C01873100 next 725 of size 9437184
2020-06-20 18:29:37.438449: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C02173100 next 598 of size 2359296
2020-06-20 18:29:37.438529: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C023B3100 next 642 of size 2359296
2020-06-20 18:29:37.438603: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C025F3100 next 675 of size 9437184
2020-06-20 18:29:37.438675: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C02EF3100 next 732 of size 9437184
2020-06-20 18:29:37.438745: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C037F3100 next 646 of size 9437184
2020-06-20 18:29:37.438816: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C040F3100 next 674 of size 2048
2020-06-20 18:29:37.438889: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C040F3900 next 648 of size 1024
2020-06-20 18:29:37.438960: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C040F3D00 next 700 of size 589824
2020-06-20 18:29:37.439033: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C04183D00 next 627 of size 512
2020-06-20 18:29:37.439103: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C04183F00 next 584 of size 9437184
2020-06-20 18:29:37.439179: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C04A83F00 next 641 of size 9437184
2020-06-20 18:29:37.439255: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C05383F00 next 739 of size 9437184
2020-06-20 18:29:37.439331: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C05C83F00 next 425 of size 2359296
2020-06-20 18:29:37.439403: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C05EC3F00 next 596 of size 2359296
2020-06-20 18:29:37.439478: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C06103F00 next 733 of size 589824
2020-06-20 18:29:37.439550: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C06193F00 next 676 of size 589824
2020-06-20 18:29:37.439622: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C06223F00 next 734 of size 589824
2020-06-20 18:29:37.439693: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C062B3F00 next 570 of size 256
2020-06-20 18:29:37.439763: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C062B4000 next 632 of size 4718592
2020-06-20 18:29:37.464023: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C06734000 next 592 of size 18874368
2020-06-20 18:29:37.464101: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C07934000 next 611 of size 2048
2020-06-20 18:29:37.464173: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C07934800 next 735 of size 512
2020-06-20 18:29:37.464244: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C07934A00 next 679 of size 512
2020-06-20 18:29:37.464317: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C07934C00 next 609 of size 4096
2020-06-20 18:29:37.464390: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C07935C00 next 604 of size 147456
2020-06-20 18:29:37.464467: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C07959C00 next 736 of size 9437184
2020-06-20 18:29:37.464541: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C08259C00 next 737 of size 256
2020-06-20 18:29:37.464613: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C08259D00 next 742 of size 19200
2020-06-20 18:29:37.464686: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C0825E800 next 681 of size 256
2020-06-20 18:29:37.464757: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C0825E900 next 682 of size 256
2020-06-20 18:29:37.464829: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C0825EA00 next 18446744073709551615 of size 196744704
2020-06-20 18:29:37.464911: I tensorflow/core/common_runtime/bfc_allocator.cc:914]      Summary of in-use Chunks by size:
2020-06-20 18:29:37.464982: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 261 Chunks of size 256 totalling 65.3KiB
2020-06-20 18:29:37.465051: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 51 Chunks of size 512 totalling 25.5KiB
2020-06-20 18:29:37.465119: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 39 Chunks of size 1024 totalling 39.0KiB
2020-06-20 18:29:37.465188: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 1280 totalling 1.3KiB
2020-06-20 18:29:37.465256: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 1536 totalling 1.5KiB
2020-06-20 18:29:37.465323: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 63 Chunks of size 2048 totalling 126.0KiB
2020-06-20 18:29:37.465392: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 7 Chunks of size 4096 totalling 28.0KiB
2020-06-20 18:29:37.465459: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 10 Chunks of size 6656 totalling 65.0KiB
2020-06-20 18:29:37.465535: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 13 Chunks of size 9728 totalling 123.5KiB
2020-06-20 18:29:37.465603: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 7 Chunks of size 16384 totalling 112.0KiB
2020-06-20 18:29:37.465670: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 6 Chunks of size 19200 totalling 112.5KiB
2020-06-20 18:29:37.465740: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 4 Chunks of size 36864 totalling 144.0KiB
2020-06-20 18:29:37.465810: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 10 Chunks of size 65536 totalling 640.0KiB
2020-06-20 18:29:37.465883: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 98304 totalling 96.0KiB
2020-06-20 18:29:37.465951: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 13 Chunks of size 147456 totalling 1.83MiB
2020-06-20 18:29:37.466020: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 196608 totalling 192.0KiB
2020-06-20 18:29:37.466088: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 9 Chunks of size 224000 totalling 1.92MiB
2020-06-20 18:29:37.466157: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 393216 totalling 384.0KiB
2020-06-20 18:29:37.466225: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 4 Chunks of size 524288 totalling 2.00MiB
2020-06-20 18:29:37.466293: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 28 Chunks of size 589824 totalling 15.75MiB
2020-06-20 18:29:37.489685: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 697088 totalling 680.8KiB
2020-06-20 18:29:37.489755: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 983040 totalling 960.0KiB
2020-06-20 18:29:37.489828: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 8 Chunks of size 1267200 totalling 9.67MiB
2020-06-20 18:29:37.489901: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 24 Chunks of size 2359296 totalling 54.00MiB
2020-06-20 18:29:37.489972: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 3562496 totalling 3.40MiB
2020-06-20 18:29:37.490042: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 4194304 totalling 4.00MiB
2020-06-20 18:29:37.490111: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 9 Chunks of size 4718592 totalling 40.50MiB
2020-06-20 18:29:37.490180: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 9 Chunks of size 5811200 totalling 49.88MiB
2020-06-20 18:29:37.490249: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 8388608 totalling 8.00MiB
2020-06-20 18:29:37.490318: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 62 Chunks of size 9437184 totalling 558.00MiB
2020-06-20 18:29:37.490390: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 8 Chunks of size 18874368 totalling 144.00MiB
2020-06-20 18:29:37.490462: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 19964160 totalling 19.04MiB
2020-06-20 18:29:37.490533: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 8 Chunks of size 24729600 totalling 188.67MiB
2020-06-20 18:29:37.490603: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 29715712 totalling 28.34MiB
2020-06-20 18:29:37.490673: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 33554432 totalling 32.00MiB
2020-06-20 18:29:37.490743: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 9 Chunks of size 105963520 totalling 909.49MiB
2020-06-20 18:29:37.490814: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 6 Chunks of size 109051904 totalling 624.00MiB
2020-06-20 18:29:37.490881: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 5 Chunks of size 150994944 totalling 720.00MiB
2020-06-20 18:29:37.490950: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 155083520 totalling 147.90MiB
2020-06-20 18:29:37.491022: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 176162048 totalling 168.00MiB
2020-06-20 18:29:37.491093: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 196744704 totalling 187.63MiB
2020-06-20 18:29:37.491166: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 267675648 totalling 255.28MiB
2020-06-20 18:29:37.491237: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 4.08GiB
2020-06-20 18:29:37.491309: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 4379901952 memory_limit_: 7053531546 available bytes: 2673629594 curr_region_allocation_bytes_: 8589934592
2020-06-20 18:29:37.491430: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:
Limit:                  7053531546
InUse:                  4379901952
MaxInUse:               6784897280
NumAllocs:                    4587
MaxAllocSize:           2406266368

2020-06-20 18:29:37.491597: W tensorflow/core/common_runtime/bfc_allocator.cc:424] ******xx******************************************************************************************xx
2020-06-20 18:29:37.491705: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cwise_ops_common.cc:82 : Resource exhausted: OOM when allocating tensor with shape[5,5,99,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
User avatar
calipheron
Posts: 15
Joined: Thu May 14, 2020 7:39 pm
Has thanked: 1 time

Re: Brand new 2060 Super. Trouble training.

Post by calipheron »

bryanlyon wrote: Sat Jun 20, 2020 5:29 pm

Those errors could just mean that they weren't able to load the data that they have on the card, or could be that the drivers are messing up. My advice is to (in this order) install the latest drivers, reboot, and close all applications to try again.

This is with a fresh install of the drivers, today.

I did choose "Studio" drivers, rather than the "normal" Geforce drivers.

User avatar
calipheron
Posts: 15
Joined: Thu May 14, 2020 7:39 pm
Has thanked: 1 time

Re: Brand new 2060 Super. Trouble training.

Post by calipheron »

Now I'm getting the following:

Code: Select all

020-06-20 19:00:06.112961: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2020-06-20 19:00:06.300959: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-06-20 19:00:06.301184: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-06-20 19:00:06.305818: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-06-20 19:00:06.306084: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-06-20 19:00:06.306303: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-06-20 19:00:06.306439: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-06-20 19:00:06.311920: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-06-20 19:00:07.197532: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2020-06-20 19:00:07.198153: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
06/20/2020 19:00:07 ERROR    Caught exception in thread: '_training_0'
06/20/2020 19:00:08 ERROR    Got Exception on main handler:
Traceback (most recent call last):
File "C:\programs\faceswap\lib\cli\launcher.py", line 155, in execute_script
process.process()
File "C:\programs\faceswap\scripts\train.py", line 161, in process
self._end_thread(thread, err)
File "C:\programs\faceswap\scripts\train.py", line 201, in _end_thread
thread.join()
File "C:\programs\faceswap\lib\multithreading.py", line 121, in join
raise thread.err[1].with_traceback(thread.err[2])
File "C:\programs\faceswap\lib\multithreading.py", line 37, in run
self._target(*self._args, **self._kwargs)
File "C:\programs\faceswap\scripts\train.py", line 226, in _training
raise err
File "C:\programs\faceswap\scripts\train.py", line 216, in _training
self._run_training_cycle(model, trainer)
File "C:\programs\faceswap\scripts\train.py", line 305, in _run_training_cycle
trainer.train_one_step(viewer, timelapse)
File "C:\programs\faceswap\plugins\train\trainer\_base.py", line 316, in train_one_step
raise err
File "C:\programs\faceswap\plugins\train\trainer\_base.py", line 283, in train_one_step
loss[side] = batcher.train_one_batch()
File "C:\programs\faceswap\plugins\train\trainer\_base.py", line 424, in train_one_batch
loss = self._model.predictors[self._side].train_on_batch(model_inputs, model_targets)
File "C:\Users\calipheron\MiniConda3\envs\264882\lib\site-packages\keras\engine\training.py", line 1217, in train_on_batch
outputs = self.train_function(ins)
File "C:\Users\calipheron\MiniConda3\envs\264882\lib\site-packages\keras\backend\tensorflow_backend.py", line 2715, in __call__
return self._call(inputs)
File "C:\Users\calipheron\MiniConda3\envs\264882\lib\site-packages\keras\backend\tensorflow_backend.py", line 2675, in _call
fetched = self._callable_fn(*array_vals)
File "C:\Users\calipheron\MiniConda3\envs\264882\lib\site-packages\tensorflow_core\python\client\session.py", line 1472, in __call__
run_metadata_ptr)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node encoder/conv_64_0_conv2d/convolution}}]]
[[loss/mul/_493]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node encoder/conv_64_0_conv2d/convolution}}]]
User avatar
calipheron
Posts: 15
Joined: Thu May 14, 2020 7:39 pm
Has thanked: 1 time

Re: Brand new 2060 Super. Trouble training.

Post by calipheron »

Cuda errors in that last lot of messages tipped me off as to the possible problem. Faceswap might be configured for Nvidia, but I was still using the old conda environment from the "AMD install."

I completely removed Anaconda / miniConda, Python and Faceswap. Made sure no conda files were in my AppData folders. Rebooted. Fresh install of Faceswap.

First try with the same settings as before, boom. No errors.
Training, Dlight model, BS of 4, using 7.7GB of 8GB VRAM.
EGs/sec of 9.8 - does this seem good?
No memory saving options at all. Should I bother with Allow Growth any more?

Anyway, for anyone moving from AMD to NVIDIA GPUs as I have - I strongly recommend a completely fresh install of Faceswap, conda, python, all of it!

User avatar
calipheron
Posts: 15
Joined: Thu May 14, 2020 7:39 pm
Has thanked: 1 time

Re: Brand new 2060 Super. Trouble training.

Post by calipheron »

Scratch that. Another crash.
Will try the latest gaming drivers instead.

Edit: Fresh install of the gaming ready drivers, v446.14 and after 40 minutes of training, no crashes.
So for Faceswapping with 2060 Super at least, I would say that the studio ready drivers are not recommended.

Edit: 8 hours of training with no problems, then:

Code: Select all

2020-06-21 04:55:31.940777: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-06-21 04:55:31.940897: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
User avatar
torzdf
Posts: 2651
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 129 times
Been thanked: 622 times

Re: Brand new 2060 Super. Trouble training.

Post by torzdf »

This looks like an upstream issue, to be honest ;(

https://github.com/tensorflow/tensorflow/issues/33536

My word is final

User avatar
calipheron
Posts: 15
Joined: Thu May 14, 2020 7:39 pm
Has thanked: 1 time

Re: Brand new 2060 Super. Trouble training.

Post by calipheron »

torzdf wrote: Sun Jun 21, 2020 10:03 am

This looks like an upstream issue, to be honest ;(

https://github.com/tensorflow/tensorflow/issues/33536

Thank you for replying Torzdf! I take it this means we have to wait for Tensorflow itself to receive a fix?

Well, so far I have found that, unlike with my RX 580, I basically have to have Faceswap running without anything else otherwise it causes crashes.

Much faster iterations now, but I'm already missing the apparently much more graceful memory handling of PlaidML (??) with AMD hardware, as with that I could have firefox and a game open as well (with reduced performance of course) and it never ever crashed while training.

For clarity, my setup:
Ryzen 3700X, 32GB RAM, Asus 2060 Super 8GB, Win 10 Pro
Nvidia Game Ready Driver 446.14, conda 4.8.3, python 3.7.7, faceswap reports it is up to date as of now
Training with DLight, extended mask, no memory mitigation options / growth
BS of 4
Absolutely maxed out VRAM, but it trains without crashing as long as I run nothing else.

Edit:
I have found I HAVE to have "allow growth" active in Extract options, otherwise faceswap fails any extract operation.

Edit:
For anyone interested, I am now training with DFL-SAE, batch size 4, and that has worked reliably. EG/s of about 7-8.

Locked