Critical Error After long Training Session

If training is failing to start, and you are not receiving an error message telling you what to do, tell us about it here


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
kholik
Posts: 4
Joined: Sat Aug 31, 2019 6:00 pm
Answers: 1

Critical Error After long Training Session

Post by kholik »

I set my input faces to begin training and after an all-day session, it just errored out.
Here's the log. It's something to do with my GPU it seems, but not quite sure how to troubleshoot. Any help would be much appreciated.

Code: Select all

09/01/2019 20:25:12 MainProcess     training_0      multithreading  __init__                  DEBUG    Initializing MultiThread: (target: 'save_encoder', thread_count: 1)
09/01/2019 20:25:12 MainProcess     training_0      multithreading  __init__                  DEBUG    Initialized MultiThread: 'save_encoder'
09/01/2019 20:25:12 MainProcess     training_0      multithreading  __init__                  DEBUG    Initializing MultiThread: (target: 'save_state', thread_count: 1)
09/01/2019 20:25:12 MainProcess     training_0      multithreading  __init__                  DEBUG    Initialized MultiThread: 'save_state'
09/01/2019 20:25:12 MainProcess     training_0      multithreading  start                     DEBUG    Starting thread(s): 'save_decoder_a'
09/01/2019 20:25:12 MainProcess     training_0      multithreading  start                     DEBUG    Starting thread 1 of 1: 'save_decoder_a_0'
09/01/2019 20:25:12 MainProcess     save_decoder_a_0 _base           save                      DEBUG    Saving model: 'A:\Videos\Deepfakes\Training Model Dir\original_decoder_A.h5'
09/01/2019 20:25:12 MainProcess     training_0      multithreading  start                     DEBUG    Started all threads 'save_decoder_a': 1
09/01/2019 20:25:12 MainProcess     training_0      multithreading  start                     DEBUG    Starting thread(s): 'save_decoder_b'
09/01/2019 20:25:12 MainProcess     training_0      multithreading  start                     DEBUG    Starting thread 1 of 1: 'save_decoder_b_0'
09/01/2019 20:25:12 MainProcess     save_decoder_b_0 _base           save                      DEBUG    Saving model: 'A:\Videos\Deepfakes\Training Model Dir\original_decoder_B.h5'
09/01/2019 20:25:12 MainProcess     training_0      multithreading  start                     DEBUG    Started all threads 'save_decoder_b': 1
09/01/2019 20:25:12 MainProcess     training_0      multithreading  start                     DEBUG    Starting thread(s): 'save_encoder'
09/01/2019 20:25:12 MainProcess     training_0      multithreading  start                     DEBUG    Starting thread 1 of 1: 'save_encoder_0'
09/01/2019 20:25:12 MainProcess     save_encoder_0  _base           save                      DEBUG    Saving model: 'A:\Videos\Deepfakes\Training Model Dir\original_encoder.h5'
09/01/2019 20:25:12 MainProcess     training_0      multithreading  start                     DEBUG    Started all threads 'save_encoder': 1
09/01/2019 20:25:12 MainProcess     training_0      multithreading  start                     DEBUG    Starting thread(s): 'save_state'
09/01/2019 20:25:12 MainProcess     training_0      multithreading  start                     DEBUG    Starting thread 1 of 1: 'save_state_0'
09/01/2019 20:25:12 MainProcess     save_state_0    _base           save                      DEBUG    Saving State
09/01/2019 20:25:12 MainProcess     training_0      multithreading  start                     DEBUG    Started all threads 'save_state': 1
09/01/2019 20:25:12 MainProcess     training_0      multithreading  join                      DEBUG    Joining Threads: 'save_decoder_a'
09/01/2019 20:25:12 MainProcess     training_0      multithreading  join                      DEBUG    Joining Thread: 'save_decoder_a_0'
09/01/2019 20:25:12 MainProcess     save_state_0    _base           save                      DEBUG    Saved State
09/01/2019 20:25:13 MainProcess     training_0      multithreading  join                      DEBUG    Joined all Threads: 'save_decoder_a'
09/01/2019 20:25:13 MainProcess     training_0      multithreading  join                      DEBUG    Joining Threads: 'save_decoder_b'
09/01/2019 20:25:13 MainProcess     training_0      multithreading  join                      DEBUG    Joining Thread: 'save_decoder_b_0'
09/01/2019 20:25:14 MainProcess     training_0      multithreading  join                      DEBUG    Joined all Threads: 'save_decoder_b'
09/01/2019 20:25:14 MainProcess     training_0      multithreading  join                      DEBUG    Joining Threads: 'save_encoder'
09/01/2019 20:25:14 MainProcess     training_0      multithreading  join                      DEBUG    Joining Thread: 'save_encoder_0'
09/01/2019 20:25:22 MainProcess     training_0      multithreading  join                      DEBUG    Joined all Threads: 'save_encoder'
09/01/2019 20:25:22 MainProcess     training_0      multithreading  join                      DEBUG    Joining Threads: 'save_state'
09/01/2019 20:25:22 MainProcess     training_0      multithreading  join                      DEBUG    Joining Thread: 'save_state_0'
09/01/2019 20:25:22 MainProcess     training_0      multithreading  join                      DEBUG    Joined all Threads: 'save_state'
09/01/2019 20:25:22 MainProcess     training_0      _base           save_models               INFO     [Saved models] - Average since last save: face_loss_A: 0.02880, face_loss_B: 0.02933
09/01/2019 20:25:34 MainProcess     training_0      training_data   join_subprocess           DEBUG    Joining FixedProducerDispatcher
09/01/2019 20:25:34 SpawnProcess-2  MainThread      training_data   load_batches              DEBUG    Finished batching: (epoch: 2860992, side: 'a', is_display: False)
09/01/2019 20:25:34 SpawnProcess-2  MainThread      multithreading  _runner                   DEBUG    FixedProducerDispatcher worker for <bound method TrainingDataGenerator.load_batches of <lib.training_data.TrainingDataGenerator object at 0x0000020E6705B128>> shutdown
09/01/2019 20:25:34 MainProcess     training_0      training_data   join_subprocess           DEBUG    Joined FixedProducerDispatcher
09/01/2019 20:25:34 MainProcess     training_0      training_data   join_subprocess           DEBUG    Joining FixedProducerDispatcher
09/01/2019 20:25:34 SpawnProcess-3  MainThread      training_data   load_batches              DEBUG    Finished batching: (epoch: 2860992, side: 'b', is_display: False)
09/01/2019 20:25:34 SpawnProcess-3  MainThread      multithreading  _runner                   DEBUG    FixedProducerDispatcher worker for <bound method TrainingDataGenerator.load_batches of <lib.training_data.TrainingDataGenerator object at 0x0000013E46EAB128>> shutdown
09/01/2019 20:25:34 MainProcess     training_0      training_data   join_subprocess           DEBUG    Joined FixedProducerDispatcher
09/01/2019 20:25:34 MainProcess     training_0      multithreading  run                       DEBUG    Error in thread (training_0): GPU sync failed
09/01/2019 20:25:35 MainProcess     MainThread      train           monitor                   DEBUG    Thread error detected
09/01/2019 20:25:35 MainProcess     MainThread      train           monitor                   DEBUG    Closed Monitor
09/01/2019 20:25:35 MainProcess     MainThread      train           end_thread                DEBUG    Ending Training thread
09/01/2019 20:25:35 MainProcess     MainThread      train           end_thread                CRITICAL Error caught! Exiting...
09/01/2019 20:25:35 MainProcess     MainThread      multithreading  join                      DEBUG    Joining Threads: 'training'
09/01/2019 20:25:35 MainProcess     MainThread      multithreading  join                      DEBUG    Joining Thread: 'training_0'
09/01/2019 20:25:35 MainProcess     MainThread      multithreading  join                      ERROR    Caught exception in thread: 'training_0'
Traceback (most recent call last):
  File "C:\Users\ADMIN\faceswap\lib\cli.py", line 125, in execute_script
    process.process()
  File "C:\Users\ADMIN\faceswap\scripts\train.py", line 98, in process
    self.end_thread(thread, err)
  File "C:\Users\ADMIN\faceswap\scripts\train.py", line 124, in end_thread
    thread.join()
  File "C:\Users\ADMIN\faceswap\lib\multithreading.py", line 461, in join
    raise thread.err[1].with_traceback(thread.err[2])
  File "C:\Users\ADMIN\faceswap\lib\multithreading.py", line 392, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\ADMIN\faceswap\scripts\train.py", line 149, in training
    raise err
  File "C:\Users\ADMIN\faceswap\scripts\train.py", line 139, in training
    self.run_training_cycle(model, trainer)
  File "C:\Users\ADMIN\faceswap\scripts\train.py", line 221, in run_training_cycle
    trainer.train_one_step(viewer, timelapse)
  File "C:\Users\ADMIN\faceswap\plugins\train\trainer\_base.py", line 213, in train_one_step
    raise err
  File "C:\Users\ADMIN\faceswap\plugins\train\trainer\_base.py", line 178, in train_one_step
    loss[side] = batcher.train_one_batch(do_preview)
  File "C:\Users\ADMIN\faceswap\plugins\train\trainer\_base.py", line 278, in train_one_batch
    loss = self.model.predictors[self.side].train_on_batch(*batch)
  File "C:\Users\ADMIN\MiniConda3\envs\faceswap\lib\site-packages\keras\engine\training.py", line 1217, in train_on_batch
    outputs = self.train_function(ins)
  File "C:\Users\ADMIN\MiniConda3\envs\faceswap\lib\site-packages\keras\backend\tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "C:\Users\ADMIN\MiniConda3\envs\faceswap\lib\site-packages\keras\backend\tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "C:\Users\ADMIN\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\client\session.py", line 1458, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.InternalError: GPU sync failed

============ System Information ============
encoding:            cp1252
git_branch:          master
git_commits:         feedd2a More robust Crash Report messaging. 5bf54d9 Add configs and state file to crash report. 7184ae3 Merge branch 'master' of https://github.com/deepfakes/faceswap. 1a18241 Revert "Delete align_eyes.py". e6f17cd Delete align_eyes.py
gpu_cuda:            No global version found. Check Conda packages for Conda Cuda
gpu_cudnn:           No global version found. Check Conda packages for Conda cuDNN
gpu_devices:         GPU_0: GeForce RTX 2060
gpu_devices_active:  GPU_0
gpu_driver:          436.15
gpu_vram:            GPU_0: 6144MB
os_machine:          AMD64
os_platform:         Windows-10-10.0.18362-SP0
os_release:          10
py_command:          C:\Users\ADMIN\faceswap\faceswap.py train -A A:/Videos/Deepfakes/JP/TRAINING -ala A:/Videos/Deepfakes/JP/TRAINING/alignments.json -B A:/Videos/Deepfakes/ME/TRAINING -alb A:/Videos/Deepfakes/ME/TRAINING/alignments.json -m A:/Videos/Deepfakes/Training Model Dir -t original -bs 64 -it 1000000 -g 1 -s 100 -ss 25000 -ps 50 -L INFO -gui
py_conda_version:    conda 4.7.11
py_implementation:   CPython
py_version:          3.6.9
py_virtual_env:      True
sys_cores:           16
sys_processor:       AMD64 Family 23 Model 8 Stepping 2, AuthenticAMD
sys_ram:             Total: 32714MB, Available: 9731MB, Used: 22982MB, Free: 9731MB

=============== Pip Packages ===============
absl-py==0.7.1
astor==0.8.0
certifi==2019.6.16
cloudpickle==1.2.1
cycler==0.10.0
cytoolz==0.10.0
dask==2.3.0
decorator==4.4.0
fastcluster==1.1.25
ffmpy==0.2.2
gast==0.2.2
grpcio==1.16.1
h5py==2.9.0
imageio==2.5.0
imageio-ffmpeg==0.3.0
joblib==0.13.2
Keras==2.2.4
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
kiwisolver==1.1.0
Markdown==3.1.1
matplotlib==2.2.2
mkl-fft==1.0.14
mkl-random==1.0.2
mkl-service==2.0.2
networkx==2.3
numpy==1.16.2
nvidia-ml-py3==7.352.1
olefile==0.46
opencv-python==4.1.0.25
pathlib==1.0.1
Pillow==6.1.0
protobuf==3.8.0
psutil==5.6.3
pyparsing==2.4.2
pyreadline==2.1
python-dateutil==2.8.0
pytz==2019.2
PyWavelets==1.0.3
pywin32==223
PyYAML==5.1.2
scikit-image==0.15.0
scikit-learn==0.21.2
scipy==1.3.1
six==1.12.0
tensorboard==1.14.0
tensorflow==1.14.0
tensorflow-estimator==1.14.0
termcolor==1.1.0
toolz==0.10.0
toposort==1.5
tornado==6.0.3
tqdm==4.32.1
Werkzeug==0.15.5
wincertstore==0.2
wrapt==1.11.2
by kholik » Tue Sep 03, 2019 12:26 am

I did restart, resume training, and it's been solid for around 12 hours. So hopefully it was just a fluke.

Go to full post
User avatar
torzdf
Posts: 2670
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 131 times
Been thanked: 625 times

Re: Critical Error After long Training Session

Post by torzdf »

If you reboot and restart the training does the issue persist?

My word is final

User avatar
kholik
Posts: 4
Joined: Sat Aug 31, 2019 6:00 pm
Answers: 1

Re: Critical Error After long Training Session

Post by kholik »

I did restart, resume training, and it's been solid for around 12 hours. So hopefully it was just a fluke.

Locked