Get error when try to continue training under Linux

If training is failing to start, and you are not receiving an error message telling you what to do, tell us about it here


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
lighting
Posts: 11
Joined: Thu Mar 02, 2023 6:54 pm
Has thanked: 1 time
Been thanked: 1 time

Get error when try to continue training under Linux

Post by lighting »

I use AMD RX 580 gpu and meet with regular crashes under Windows (i suppose because of DirectML) so i try to install Ubuntu to change DirectML on a ROCM.
Change OS solve my problem with training crashes, but now i faced with strange error when try to continue training model.

Code: Select all

04/04/2023 18:57:06 INFO     ===================================================
04/04/2023 18:57:06 INFO       Starting
04/04/2023 18:57:06 INFO     ===================================================
04/04/2023 18:57:07 INFO     Loading data, this may take a while...
04/04/2023 18:57:07 INFO     Loading Model from Dfaker plugin...
04/04/2023 18:57:07 INFO     Config item: 'epsilon_exponent' has been updated from '-6' to '-7'
04/04/2023 18:57:07 INFO     Config item: 'convert_batchsize' has been updated from '8' to '16'
04/04/2023 18:57:07 INFO     Config item: 'eye_multiplier' has been updated from '3' to '1'
04/04/2023 18:57:07 INFO     Config item: 'mouth_multiplier' has been updated from '2' to '1'
04/04/2023 18:57:07 INFO     Using configuration saved in state file
04/04/2023 18:57:07 CRITICAL Error caught! Exiting...
04/04/2023 18:57:07 ERROR    Caught exception in thread: '_training'
ls: unable to access to '/home/lighting/.local/lib/python3.10/site-packages/cv2/../../lib64': no such file or directory
ls: unable to access to '/home/lighting/.local/lib/python3.10/site-packages/cv2/../../lib64': no such file or directory
04/04/2023 18:57:08 ERROR    Got Exception on main handler:
Traceback (most recent call last):
  File "/home/lighting/faceswap/lib/cli/launcher.py", line 230, in execute_script
    process.process()
  File "/home/lighting/faceswap/scripts/train.py", line 213, in process
    self._end_thread(thread, err)
  File "/home/lighting/faceswap/scripts/train.py", line 253, in _end_thread
    thread.join()
  File "/home/lighting/faceswap/lib/multithreading.py", line 220, in join
    raise thread.err[1].with_traceback(thread.err[2])
  File "/home/lighting/faceswap/lib/multithreading.py", line 96, in run
    self._target(*self._args, **self._kwargs)
  File "/home/lighting/faceswap/scripts/train.py", line 275, in _training
    raise err
  File "/home/lighting/faceswap/scripts/train.py", line 263, in _training
    model = self._load_model()
  File "/home/lighting/faceswap/scripts/train.py", line 291, in _load_model
    model.build()
  File "/home/lighting/faceswap/plugins/train/model/_base/model.py", line 304, in build
    model = self._io._load()  # pylint:disable=protected-access
  File "/home/lighting/faceswap/plugins/train/model/_base/io.py", line 152, in _load
    model = load_model(self._filename, compile=False)
  File "/home/lighting/.local/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/lighting/.local/lib/python3.10/site-packages/h5py/_hl/files.py", line 567, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
  File "/home/lighting/.local/lib/python3.10/site-packages/h5py/_hl/files.py", line 231, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 106, in h5py.h5f.open
OSError: Unable to open file (file signature not found)
04/04/2023 18:57:08 CRITICAL An unexpected crash has occurred. Crash report written to '/home/lighting/faceswap/crash_report.2023.04.04.185707983528.log'. You MUST provide this file if seeking assistance. Please verify you are running the latest version of faceswap before reporting
Process exited.

I don't understand why it try to go two levels up in search of lib64 (../../). Can't see any suspicious in log file.
Newly created model start to train succesfully. Any suggestion?

Attachments
crash_report.2023.04.04.184705037387.zip
(49.25 KiB) Downloaded 61 times
User avatar
torzdf
Posts: 2649
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 128 times
Been thanked: 623 times

Re: Get error when try to continue training under Linux

Post by torzdf »

The model file is corrupted. See here for mitigation steps:
viewtopic.php?p=8361#p8361

*Note the "restore" tool is know called the "model" tool (with the restore option)

Glad to know ROCm is working. I was never able to test it prior to implementation (no valid AMD card here).

My word is final

Locked