Model training crashing on AWS - Memory Error

Want to use Faceswap in The Cloud? This is not directly supported by the Devs, but you may find community support here


Forum rules

Read the FAQs and search the forum before posting a new topic.

NB: The Devs do not directly support using Cloud based services, but you can find community support here.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
alexsoares
Posts: 1
Joined: Tue Jun 22, 2021 7:50 pm

Model training crashing on AWS - Memory Error

Post by alexsoares »

Everything worked fine while I used my own notebook and I was able to build some models. I am now using and instance in Amazon services (AWS). I managed to install Faceswap, extract faces, sort them and adjust alignment using Manual menu. Now I am train to train the model and the application crashes.

Can you help me please?

Please see below the messages and the message box and, right below, a copy of the Crash Report.

Code: Select all

06/22/2021 19:46:42 INFO     Log level set to: INFO
06/22/2021 19:46:44 INFO     Model A Directory: 'C:\Users\Administrator\Documents\new project\new extract sorted' (5364 images)
06/22/2021 19:46:44 INFO     Model B Directory: 'C:\Users\Administrator\Documents\kt faces extract' (338 images)
06/22/2021 19:46:44 INFO     Training data directory: C:\Users\Administrator\Documents\new project\original model
06/22/2021 19:46:44 INFO     ===================================================
06/22/2021 19:46:44 INFO       Starting
06/22/2021 19:46:44 INFO       Press 'Stop' to save and quit
06/22/2021 19:46:44 INFO     ===================================================
06/22/2021 19:46:45 INFO     Loading data, this may take a while...
06/22/2021 19:46:45 INFO     Loading Model from Original plugin...
06/22/2021 19:46:46 INFO     No existing state file found. Generating.
06/22/2021 19:46:50 INFO     Loading Trainer from Original plugin...

06/22/2021 19:47:10 CRITICAL Error caught! Exiting...
06/22/2021 19:47:10 ERROR    Caught exception in thread: '_training_0'
06/22/2021 19:47:13 ERROR    Got Exception on main handler:
Traceback (most recent call last):
  File "C:\Users\Administrator\faceswap\lib\cli\launcher.py", line 182, in execute_script
    process.process()
  File "C:\Users\Administrator\faceswap\scripts\train.py", line 190, in process
    self._end_thread(thread, err)
  File "C:\Users\Administrator\faceswap\scripts\train.py", line 230, in _end_thread
    thread.join()
  File "C:\Users\Administrator\faceswap\lib\multithreading.py", line 121, in join
    raise thread.err[1].with_traceback(thread.err[2])
  File "C:\Users\Administrator\faceswap\lib\multithreading.py", line 37, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\Administrator\faceswap\scripts\train.py", line 252, in _training
    raise err
  File "C:\Users\Administrator\faceswap\scripts\train.py", line 242, in _training
    self._run_training_cycle(model, trainer)
  File "C:\Users\Administrator\faceswap\scripts\train.py", line 340, in _run_training_cycle
    model.save()
  File "C:\Users\Administrator\faceswap\plugins\train\model\_base.py", line 401, in save
    self._io._save()  # pylint:disable=protected-access
  File "C:\Users\Administrator\faceswap\plugins\train\model\_base.py", line 597, in _save
    self._plugin.model.save(self._filename, include_optimizer=False)
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1978, in save
    save.save_model(self, filepath, overwrite, include_optimizer, save_format,
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\saving\save.py", line 130, in save_model
    hdf5_format.save_model_to_hdf5(
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\saving\hdf5_format.py", line 119, in save_model_to_hdf5
    save_weights_to_hdf5_group(model_weights_group, model_layers)
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\saving\hdf5_format.py", line 636, in save_weights_to_hdf5_group
    weight_values = K.batch_get_value(weights)
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\util\dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\backend.py", line 3518, in batch_get_value
    return [x.numpy() for x in tensors]
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\backend.py", line 3518, in <listcomp>
    return [x.numpy() for x in tensors]
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\ops\resource_variable_ops.py", line 608, in numpy
    return self.read_value().numpy()
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\framework\ops.py", line 1064, in numpy
    return maybe_arr.copy() if isinstance(maybe_arr, np.ndarray) else maybe_arr
MemoryError: Unable to allocate 72.0 MiB for an array with shape (3, 3, 1024, 2048) and data type float32
06/22/2021 19:47:13 CRITICAL An unexpected crash has occurred. Crash report written to 'C:\Users\Administrator\faceswap\crash_report.2021.06.22.194710466076.log'. You MUST provide this file if seeking assistance. Please verify you are running the latest version of faceswap before reporting
Process exited.

CRASH REPORT:

Code: Select all

06/22/2021 19:46:54 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BBFF5550>, weight: 2.0, mask_channel: 5)
06/22/2021 19:46:54 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 5
06/22/2021 19:46:54 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BBFF5D30>, weight: 1.0, mask_channel: 2)
06/22/2021 19:46:54 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 2
06/22/2021 19:46:54 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC01A550>, weight: 1.0, mask_channel: 3)
06/22/2021 19:46:54 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 3
06/22/2021 19:46:54 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC01AD30>, weight: 1.0, mask_channel: 3)
06/22/2021 19:46:54 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 3
06/22/2021 19:46:54 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC3F3250>, weight: 3.0, mask_channel: 4)
06/22/2021 19:46:54 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 4
06/22/2021 19:46:54 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC3F3850>, weight: 1.0, mask_channel: 1)
06/22/2021 19:46:54 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 1
06/22/2021 19:46:54 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC3D2580>, weight: 2.0, mask_channel: 5)
06/22/2021 19:46:54 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 5
06/22/2021 19:46:54 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC3D2D60>, weight: 1.0, mask_channel: 2)
06/22/2021 19:46:54 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 2
06/22/2021 19:46:57 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC4FB8E0>, weight: 1.0, mask_channel: 3)
06/22/2021 19:46:57 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 3
06/22/2021 19:46:57 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC02C910>, weight: 1.0, mask_channel: 3)
06/22/2021 19:46:57 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 3
06/22/2021 19:46:57 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC0103A0>, weight: 3.0, mask_channel: 4)
06/22/2021 19:46:57 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 4
06/22/2021 19:46:57 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC010D60>, weight: 1.0, mask_channel: 1)
06/22/2021 19:46:57 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 1
06/22/2021 19:46:57 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BBFF5550>, weight: 2.0, mask_channel: 5)
06/22/2021 19:46:57 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 5
06/22/2021 19:46:57 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BBFF5D30>, weight: 1.0, mask_channel: 2)
06/22/2021 19:46:57 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 2
06/22/2021 19:46:58 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC01A550>, weight: 1.0, mask_channel: 3)
06/22/2021 19:46:58 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 3
06/22/2021 19:46:58 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC01AD30>, weight: 1.0, mask_channel: 3)
06/22/2021 19:46:58 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 3
06/22/2021 19:46:58 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC3F3250>, weight: 3.0, mask_channel: 4)
06/22/2021 19:46:58 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 4
06/22/2021 19:46:58 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC3F3850>, weight: 1.0, mask_channel: 1)
06/22/2021 19:46:58 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 1
06/22/2021 19:46:58 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC3D2580>, weight: 2.0, mask_channel: 5)
06/22/2021 19:46:58 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 5
06/22/2021 19:46:58 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC3D2D60>, weight: 1.0, mask_channel: 2)
06/22/2021 19:46:58 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 2
06/22/2021 19:47:07 MainProcess     _training_0                    _base           generate_preview               DEBUG    Generating preview
06/22/2021 19:47:07 MainProcess     _training_0                    _base           compile_sample                 DEBUG    Compiling samples: (side: 'a', samples: 14)
06/22/2021 19:47:07 MainProcess     _training_0                    _base           compile_sample                 DEBUG    Compiling samples: (side: 'b', samples: 14)
06/22/2021 19:47:07 MainProcess     _training_0                    _base           show_sample                    DEBUG    Showing sample
06/22/2021 19:47:07 MainProcess     _training_0                    _base           _get_predictions               DEBUG    Getting Predictions
06/22/2021 19:47:07 MainProcess     _run_1                         generator       cache_metadata                 DEBUG    All metadata already cached for: ['01451.png', '03624.png', '00100.png', '02930.png', '05609.png', '02460.png', '02356.png', '03213.png', '04926.png', '04229.png', '04143.png', '03712.png', '01126.png', '04755.png']
06/22/2021 19:47:07 MainProcess     _run_1                         generator       cache_metadata                 DEBUG    All metadata already cached for: ['33622072_0_0.png', '02270001_0_0.png', '_DSC4103_0_0.png', '12140065_0_0.png', '_DSC4302_0_0.png', '12140023_0_0.png', '_DSC5958_0_0.png', 'Copy (3) of DSC_0029_0_0.png', '_DSC4105_0_0.png', '12140004_0_0.png', '_DSC4259_0_0.png', '20_0_0.png', '04160001_0_0.png', '_DSC4096_0_0.png']
06/22/2021 19:47:08 MainProcess     _training_0                    _base           _get_predictions               DEBUG    Returning predictions: {'a_a': (14, 64, 64, 4), 'b_b': (14, 64, 64, 4), 'a_b': (14, 64, 64, 4), 'b_a': (14, 64, 64, 4)}
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _to_full_frame                 DEBUG    side: 'a', number of sample arrays: 3, prediction.shapes: [(14, 64, 64, 4), (14, 64, 64, 4)])
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _process_full                  DEBUG    full_size: 384, prediction_size: 64, color: (0, 0, 255)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _resize_sample                 DEBUG    Resizing sample: (side: 'a', sample.shape: (14, 384, 384, 3), target_size: 92, scale: 0.23958333333333334)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _resize_sample                 DEBUG    Resized sample: (side: 'a' shape: (14, 92, 92, 3))
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _process_full                  DEBUG    Overlayed background. Shape: (14, 92, 92, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _compile_masked                DEBUG    masked shapes: [(14, 64, 64, 3), (14, 64, 64, 3), (14, 64, 64, 3)]
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _overlay_foreground            DEBUG    Overlayed foreground. Shape: (14, 92, 92, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _overlay_foreground            DEBUG    Overlayed foreground. Shape: (14, 92, 92, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _overlay_foreground            DEBUG    Overlayed foreground. Shape: (14, 92, 92, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _get_headers                   DEBUG    side: 'a', width: 92
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _get_headers                   DEBUG    height: 20, total_width: 276
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _get_headers                   DEBUG    texts: ['Original (A)', 'Original > Original', 'Original > Swap'], text_sizes: [(52, 7), (84, 7), (73, 7)], text_x: [20, 96, 193], text_y: 13
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _get_headers                   DEBUG    header_box.shape: (20, 276, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _to_full_frame                 DEBUG    side: 'b', number of sample arrays: 3, prediction.shapes: [(14, 64, 64, 4), (14, 64, 64, 4)])
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _process_full                  DEBUG    full_size: 384, prediction_size: 64, color: (0, 0, 255)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _resize_sample                 DEBUG    Resizing sample: (side: 'b', sample.shape: (14, 384, 384, 3), target_size: 92, scale: 0.23958333333333334)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _resize_sample                 DEBUG    Resized sample: (side: 'b' shape: (14, 92, 92, 3))
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _process_full                  DEBUG    Overlayed background. Shape: (14, 92, 92, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _compile_masked                DEBUG    masked shapes: [(14, 64, 64, 3), (14, 64, 64, 3), (14, 64, 64, 3)]
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _overlay_foreground            DEBUG    Overlayed foreground. Shape: (14, 92, 92, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _overlay_foreground            DEBUG    Overlayed foreground. Shape: (14, 92, 92, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _overlay_foreground            DEBUG    Overlayed foreground. Shape: (14, 92, 92, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _get_headers                   DEBUG    side: 'b', width: 92
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _get_headers                   DEBUG    height: 20, total_width: 276
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _get_headers                   DEBUG    texts: ['Swap (B)', 'Swap > Swap', 'Swap > Original'], text_sizes: [(43, 7), (63, 7), (73, 7)], text_x: [24, 106, 193], text_y: 13
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _get_headers                   DEBUG    header_box.shape: (20, 276, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _duplicate_headers             DEBUG    side: a header.shape: (20, 276, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _duplicate_headers             DEBUG    side: b header.shape: (20, 276, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _stack_images                  DEBUG    Stack images
06/22/2021 19:47:09 MainProcess     _training_0                    _base           get_transpose_axes             DEBUG    Even number of images to stack
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _stack_images                  DEBUG    Stacked images
06/22/2021 19:47:09 MainProcess     _training_0                    _base           show_sample                    DEBUG    Compiled sample
06/22/2021 19:47:09 MainProcess     _training_0                    train           _show                          DEBUG    Updating preview: (name: Training - 'S': Save Now. 'R': Refresh Preview. 'M': Toggle Mask. 'ENTER': Save and Quit)
06/22/2021 19:47:09 MainProcess     _training_0                    train           _show                          DEBUG    Generating preview for GUI
06/22/2021 19:47:09 MainProcess     _training_0                    train           _show                          DEBUG    Generated preview for GUI: '.gui_training_preview.jpg'
06/22/2021 19:47:09 MainProcess     _training_0                    train           _show                          DEBUG    Updated preview: (name: Training - 'S': Save Now. 'R': Refresh Preview. 'M': Toggle Mask. 'ENTER': Save and Quit)
06/22/2021 19:47:09 MainProcess     _training_0                    train           _run_training_cycle            DEBUG    Save Iteration: (iteration: 1
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _save                          DEBUG    Backing up and saving models
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _get_save_averages             DEBUG    Getting save averages
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _get_save_averages             DEBUG    Average losses since last save: [0.8635251820087433, 0.7475658655166626]
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _should_backup                 DEBUG    Set initial save iteration loss average for 'a': 0.8635251820087433
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _should_backup                 DEBUG    Set initial save iteration loss average for 'b': 0.7475658655166626
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _should_backup                 DEBUG    Updated lowest historical save iteration averages from: {'a': 0.8635251820087433, 'b': 0.7475658655166626} to: {'a': 0.8635251820087433, 'b': 0.7475658655166626}
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _should_backup                 DEBUG    Should backup: True
06/22/2021 19:47:09 MainProcess     _training_0                    multithreading  run                            DEBUG    Error in thread (_training_0): Unable to allocate 72.0 MiB for an array with shape (3, 3, 1024, 2048) and data type float32
06/22/2021 19:47:10 MainProcess     MainThread                     train           _monitor                       DEBUG    Thread error detected
06/22/2021 19:47:10 MainProcess     MainThread                     train           _monitor                       DEBUG    Closed Monitor
06/22/2021 19:47:10 MainProcess     MainThread                     train           _end_thread                    DEBUG    Ending Training thread
06/22/2021 19:47:10 MainProcess     MainThread                     train           _end_thread                    CRITICAL Error caught! Exiting...
06/22/2021 19:47:10 MainProcess     MainThread                     multithreading  join                           DEBUG    Joining Threads: '_training'
06/22/2021 19:47:10 MainProcess     MainThread                     multithreading  join                           DEBUG    Joining Thread: '_training_0'
06/22/2021 19:47:10 MainProcess     MainThread                     multithreading  join                           ERROR    Caught exception in thread: '_training_0'
Traceback (most recent call last):
  File "C:\Users\Administrator\faceswap\lib\cli\launcher.py", line 182, in execute_script
    process.process()
  File "C:\Users\Administrator\faceswap\scripts\train.py", line 190, in process
    self._end_thread(thread, err)
  File "C:\Users\Administrator\faceswap\scripts\train.py", line 230, in _end_thread
    thread.join()
  File "C:\Users\Administrator\faceswap\lib\multithreading.py", line 121, in join
    raise thread.err[1].with_traceback(thread.err[2])
  File "C:\Users\Administrator\faceswap\lib\multithreading.py", line 37, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\Administrator\faceswap\scripts\train.py", line 252, in _training
    raise err
  File "C:\Users\Administrator\faceswap\scripts\train.py", line 242, in _training
    self._run_training_cycle(model, trainer)
  File "C:\Users\Administrator\faceswap\scripts\train.py", line 340, in _run_training_cycle
    model.save()
  File "C:\Users\Administrator\faceswap\plugins\train\model\_base.py", line 401, in save
    self._io._save()  # pylint:disable=protected-access
  File "C:\Users\Administrator\faceswap\plugins\train\model\_base.py", line 597, in _save
    self._plugin.model.save(self._filename, include_optimizer=False)
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1978, in save
    save.save_model(self, filepath, overwrite, include_optimizer, save_format,
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\saving\save.py", line 130, in save_model
    hdf5_format.save_model_to_hdf5(
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\saving\hdf5_format.py", line 119, in save_model_to_hdf5
    save_weights_to_hdf5_group(model_weights_group, model_layers)
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\saving\hdf5_format.py", line 636, in save_weights_to_hdf5_group
    weight_values = K.batch_get_value(weights)
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\util\dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\backend.py", line 3518, in batch_get_value
    return [x.numpy() for x in tensors]
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\backend.py", line 3518, in <listcomp>
    return [x.numpy() for x in tensors]
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\ops\resource_variable_ops.py", line 608, in numpy
    return self.read_value().numpy()
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\framework\ops.py", line 1064, in numpy
    return maybe_arr.copy() if isinstance(maybe_arr, np.ndarray) else maybe_arr
MemoryError: Unable to allocate 72.0 MiB for an array with shape (3, 3, 1024, 2048) and data type float32

============ System Information ============
encoding:            cp1252
git_branch:          master
git_commits:         55bb723 New Model: Phaze-A
gpu_cuda:            No global version found. Check Conda packages for Conda Cuda
gpu_cudnn:           No global version found. Check Conda packages for Conda cuDNN
gpu_devices:         GPU_0: Tesla T4
gpu_devices_active:  GPU_0
gpu_driver:          461.40
gpu_vram:            GPU_0: 15360MB
os_machine:          AMD64
os_platform:         Windows-10-10.0.17763-SP0
os_release:          10
py_command:          C:\Users\Administrator\faceswap\faceswap.py train -A C:/Users/Administrator/Documents/new project/new extract sorted -B C:/Users/Administrator/Documents/kt faces extract -m C:/Users/Administrator/Documents/new project/original model -t original -bs 4 -it 500000 -s 250 -ss 25000 -ps 100 -L INFO -gui
py_conda_version:    conda 4.10.1
py_implementation:   CPython
py_version:          3.8.10
py_virtual_env:      True
sys_cores:           4
sys_processor:       Intel64 Family 6 Model 85 Stepping 7, GenuineIntel
sys_ram:             Total: 16083MB, Available: 11639MB, Used: 4443MB, Free: 11639MB

=============== Pip Packages ===============
absl-py @ file:///C:/ci/absl-py_1615411229697/work
aiohttp @ file:///C:/ci/aiohttp_1614361024229/work
astor==0.8.1
astunparse==1.6.3
async-timeout==3.0.1
attrs @ file:///tmp/build/80754af9/attrs_1620827162558/work
blinker==1.4
brotlipy==0.7.0
cachetools @ file:///tmp/build/80754af9/cachetools_1619597386817/work
certifi==2021.5.30
cffi @ file:///C:/ci/cffi_1613247279197/work
chardet @ file:///C:/ci/chardet_1605303225733/work
click @ file:///tmp/build/80754af9/click_1621604852318/work
coverage @ file:///C:/ci/coverage_1614615074147/work
cryptography @ file:///C:/ci/cryptography_1616769344312/work
cycler==0.10.0
Cython @ file:///C:/ci/cython_1618435363327/work
fastcluster==1.1.26
ffmpy==0.2.3
gast @ file:///tmp/build/80754af9/gast_1597433534803/work
google-auth @ file:///tmp/build/80754af9/google-auth_1623354748502/work
google-auth-oauthlib @ file:///tmp/build/80754af9/google-auth-oauthlib_1617120569401/work
google-pasta==0.2.0
grpcio @ file:///C:/ci/grpcio_1614884412260/work
h5py==2.10.0
idna @ file:///home/linux1/recipes/ci/idna_1610986105248/work
imageio @ file:///tmp/build/80754af9/imageio_1617700267927/work
imageio-ffmpeg @ file:///home/conda/feedstock_root/build_artifacts/imageio-ffmpeg_1621542018480/work
importlib-metadata @ file:///C:/ci/importlib-metadata_1617877484576/work
joblib @ file:///tmp/build/80754af9/joblib_1613502643832/work
Keras-Applications @ file:///tmp/build/80754af9/keras-applications_1594366238411/work
Keras-Preprocessing @ file:///tmp/build/80754af9/keras-preprocessing_1612283640596/work
kiwisolver @ file:///C:/ci/kiwisolver_1612282606037/work
Markdown @ file:///C:/ci/markdown_1614364121613/work
matplotlib @ file:///C:/ci/matplotlib-base_1592837548929/work
mkl-fft==1.3.0
mkl-random==1.1.1
mkl-service==2.3.0
multidict @ file:///C:/ci/multidict_1607362065515/work
numpy @ file:///C:/ci/numpy_and_numpy_base_1603466732592/work
nvidia-ml-py3 @ git+https://github.com/deepfakes/nvidia-ml-py3.git@6fc29ac84b32bad877f078cb4a777c1548a00bf6
oauthlib==3.1.0
olefile==0.46
opencv-python==4.5.2.54
opt-einsum @ file:///tmp/build/80754af9/opt_einsum_1621500238896/work
pathlib==1.0.1
Pillow @ file:///C:/ci/pillow_1617386341487/work
protobuf==3.14.0
psutil @ file:///C:/ci/psutil_1612298324802/work
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser @ file:///tmp/build/80754af9/pycparser_1594388511720/work
PyJWT==1.7.1
pyOpenSSL @ file:///tmp/build/80754af9/pyopenssl_1608057966937/work
pyparsing @ file:///home/linux1/recipes/ci/pyparsing_1610983426697/work
pyreadline==2.1
PySocks @ file:///C:/ci/pysocks_1605287845585/work
python-dateutil @ file:///home/ktietz/src/ci/python-dateutil_1611928101742/work
pywin32==227
requests @ file:///tmp/build/80754af9/requests_1608241421344/work
requests-oauthlib==1.3.0
rsa @ file:///tmp/build/80754af9/rsa_1614366226499/work
scikit-learn @ file:///C:/ci/scikit-learn_1622739500535/work
scipy @ file:///C:/ci/scipy_1616703433439/work
sip==4.19.13
six @ file:///tmp/build/80754af9/six_1623709665295/work
tensorboard @ file:///home/builder/ktietz/aggregate/tensorflow_recipes/ci_te/tensorboard_1614593728657/work/tmp_pip_dir
tensorboard-plugin-wit==1.6.0
tensorflow==2.3.0
tensorflow-estimator @ file:///home/builder/ktietz/aggregate/tensorflow_recipes/ci_baze37/tensorflow-estimator_1622026529081/work/tensorflow_estimator-2.5.0-py2.py3-none-any.whl
termcolor==1.1.0
threadpoolctl @ file:///tmp/tmp9twdgx9k/threadpoolctl-2.1.0-py3-none-any.whl
tornado @ file:///C:/ci/tornado_1606942392901/work
tqdm @ file:///tmp/build/80754af9/tqdm_1615925068909/work
typing-extensions @ file:///tmp/build/80754af9/typing_extensions_1611751222202/work
urllib3 @ file:///tmp/build/80754af9/urllib3_1615837158687/work
Werkzeug @ file:///home/ktietz/src/ci/werkzeug_1611932622770/work
win-inet-pton @ file:///C:/ci/win_inet_pton_1605306167264/work
wincertstore==0.2
wrapt==1.12.1
yarl @ file:///C:/ci/yarl_1606940076464/work
zipp @ file:///tmp/build/80754af9/zipp_1615904174917/work

============== Conda Packages ==============
# packages in environment at C:\Users\Administrator\MiniConda3\envs\faceswap:
#
# Name                    Version                   Build  Channel
_tflow_select             2.3.0                       gpu  
absl-py                   0.12.0           py38haa95532_0  
aiohttp                   3.7.4            py38h2bbff1b_1  
astor                     0.8.1            py38haa95532_0  
astunparse                1.6.3                      py_0  
async-timeout             3.0.1            py38haa95532_0  
attrs                     21.2.0             pyhd3eb1b0_0  
blas                      1.0                         mkl  
blinker                   1.4              py38haa95532_0  
brotlipy                  0.7.0           py38h2bbff1b_1003  
ca-certificates           2021.5.25            haa95532_1  
cachetools                4.2.2              pyhd3eb1b0_0  
certifi                   2021.5.30        py38haa95532_0  
cffi                      1.14.5           py38hcd4344a_0  
chardet                   3.0.4           py38haa95532_1003  
click                     8.0.1              pyhd3eb1b0_0  
coverage                  5.5              py38h2bbff1b_2  
cryptography              3.4.7            py38h71e12ea_0  
cudatoolkit               10.1.243             h74a9793_0  
cudnn                     7.6.5                cuda10.1_0  
cycler                    0.10.0                   py38_0  
cython                    0.29.23          py38hd77b12b_0  
fastcluster               1.1.26           py38h251f6bf_2    conda-forge
ffmpeg                    4.3.1                ha925a31_0    conda-forge
ffmpy                     0.2.3                    pypi_0    pypi
freetype                  2.10.4               hd328e21_0  
gast                      0.4.0                      py_0  
git                       2.23.0               h6bb4b03_0  
google-auth               1.31.0             pyhd3eb1b0_0  
google-auth-oauthlib      0.4.4              pyhd3eb1b0_0  
google-pasta              0.2.0                      py_0  
grpcio                    1.36.1           py38hc60d5dd_1  
h5py                      2.10.0           py38h5e291fa_0  
hdf5                      1.10.4               h7ebc959_0  
icc_rt                    2019.0.0             h0cc432a_1  
icu                       58.2                 ha925a31_3  
idna                      2.10               pyhd3eb1b0_0  
imageio                   2.9.0              pyhd3eb1b0_0  
imageio-ffmpeg            0.4.4              pyhd8ed1ab_0    conda-forge
importlib-metadata        3.10.0           py38haa95532_0  
intel-openmp              2021.2.0           haa95532_616  
joblib                    1.0.1              pyhd3eb1b0_0  
jpeg                      9b                   hb83a4c4_2  
keras-applications        1.0.8                      py_1  
keras-preprocessing       1.1.2              pyhd3eb1b0_0  
kiwisolver                1.3.1            py38hd77b12b_0  
libpng                    1.6.37               h2a8f88b_0  
libprotobuf               3.14.0               h23ce68f_0  
libtiff                   4.2.0                hd0e1b90_0  
lz4-c                     1.9.3                h2bbff1b_0  
markdown                  3.3.4            py38haa95532_0  
matplotlib                3.2.2                         0  
matplotlib-base           3.2.2            py38h64f37c6_0  
mkl                       2020.2                      256  
mkl-service               2.3.0            py38h196d8e1_0  
mkl_fft                   1.3.0            py38h46781fe_0  
mkl_random                1.1.1            py38h47e9c7a_0  
multidict                 5.1.0            py38h2bbff1b_2  
numpy                     1.19.2           py38hadc3359_0  
numpy-base                1.19.2           py38ha3acd2a_0  
nvidia-ml-py3             7.352.1                  pypi_0    pypi
oauthlib                  3.1.0                      py_0  
olefile                   0.46                       py_0  
opencv-python             4.5.2.54                 pypi_0    pypi
openssl                   1.1.1k               h2bbff1b_0  
opt_einsum                3.3.0              pyhd3eb1b0_1  
pathlib                   1.0.1                      py_1  
pillow                    8.2.0            py38h4fa10fc_0  
pip                       21.1.2           py38haa95532_0  
protobuf                  3.14.0           py38hd77b12b_1  
psutil                    5.8.0            py38h2bbff1b_1  
pyasn1                    0.4.8                      py_0  
pyasn1-modules            0.2.8                      py_0  
pycparser                 2.20                       py_2  
pyjwt                     1.7.1                    py38_0  
pyopenssl                 20.0.1             pyhd3eb1b0_1  
pyparsing                 2.4.7              pyhd3eb1b0_0  
pyqt                      5.9.2            py38ha925a31_4  
pyreadline                2.1                      py38_1  
pysocks                   1.7.1            py38haa95532_0  
python                    3.8.10               hdbf39b2_7  
python-dateutil           2.8.1              pyhd3eb1b0_0  
python_abi                3.8                      1_cp38    conda-forge
pywin32                   227              py38he774522_1  
qt                        5.9.7            vc14h73c81de_0  
requests                  2.25.1             pyhd3eb1b0_0  
requests-oauthlib         1.3.0                      py_0  
rsa                       4.7.2              pyhd3eb1b0_1  
scikit-learn              0.24.2           py38hf11a4ad_1  
scipy                     1.6.2            py38h14eb087_0  
setuptools                52.0.0           py38haa95532_0  
sip                       4.19.13          py38ha925a31_0  
six                       1.16.0             pyhd3eb1b0_0  
sqlite                    3.35.4               h2bbff1b_0  
tensorboard               2.4.0              pyhc547734_0  
tensorboard-plugin-wit    1.6.0                      py_0  
tensorflow                2.3.0           mkl_py38h1fcfbd6_0  
tensorflow-base           2.3.0           gpu_py38h7339f5a_0  
tensorflow-estimator      2.5.0              pyh7b7c402_0  
tensorflow-gpu            2.3.0                he13fc11_0  
termcolor                 1.1.0            py38haa95532_1  
threadpoolctl             2.1.0              pyh5ca1d4c_0  
tk                        8.6.10               he774522_0  
tornado                   6.1              py38h2bbff1b_0  
tqdm                      4.59.0             pyhd3eb1b0_1  
typing-extensions         3.7.4.3              hd3eb1b0_0  
typing_extensions         3.7.4.3            pyh06a4308_0  
urllib3                   1.26.4             pyhd3eb1b0_0  
vc                        14.2                 h21ff451_1  
vs2015_runtime            14.27.29016          h5e58377_2  
werkzeug                  1.0.1              pyhd3eb1b0_0  
wheel                     0.36.2             pyhd3eb1b0_0  
win_inet_pton             1.1.0            py38haa95532_0  
wincertstore              0.2                      py38_0  
wrapt                     1.12.1           py38he774522_1  
xz                        5.2.5                h62dcd97_0  
yarl                      1.6.3            py38h2bbff1b_0  
zipp                      3.4.1              pyhd3eb1b0_0  
zlib                      1.2.11               h62dcd97_4  
zstd                      1.4.9                h19a0ad4_0  

================= Configs ==================
--------- .faceswap ---------
backend:                  nvidia

--------- convert.ini ---------

[color.color_transfer]
clip:                     True
preserve_paper:           True

[color.manual_balance]
colorspace:               HSV
balance_1:                0.0
balance_2:                0.0
balance_3:                0.0
contrast:                 0.0
brightness:               0.0

[color.match_hist]
threshold:                99.0

[mask.box_blend]
type:                     gaussian
distance:                 11.0
radius:                   5.0
passes:                   1

[mask.mask_blend]
type:                     normalized
kernel_size:              3
passes:                   4
threshold:                4
erosion:                  0.0

[scaling.sharpen]
method:                   none
amount:                   150
radius:                   0.3
threshold:                5.0

[writer.ffmpeg]
container:                mp4
codec:                    libx264
crf:                      23
preset:                   medium
tune:                     none
profile:                  auto
level:                    auto
skip_mux:                 False

[writer.gif]
fps:                      25
loop:                     0
palettesize:              256
subrectangles:            False

[writer.opencv]
format:                   png
draw_transparent:         False
jpg_quality:              75
png_compress_level:       3

[writer.pillow]
format:                   png
draw_transparent:         False
optimize:                 False
gif_interlace:            True
jpg_quality:              75
png_compress_level:       3
tif_compression:          tiff_deflate

--------- extract.ini ---------

[global]
allow_growth:             True

[align.fan]
batch-size:               12

[detect.cv2_dnn]
confidence:               50

[detect.mtcnn]
minsize:                  20
scalefactor:              0.709
batch-size:               8
threshold_1:              0.6
threshold_2:              0.7
threshold_3:              0.7

[detect.s3fd]
confidence:               70
batch-size:               4

[mask.bisenet_fp]
batch-size:               8
include_ears:             False
include_hair:             False
include_glasses:          True

[mask.unet_dfl]
batch-size:               8

[mask.vgg_clear]
batch-size:               6

[mask.vgg_obstructed]
batch-size:               2

--------- gui.ini ---------

[global]
fullscreen:               False
tab:                      extract
options_panel_width:      30
console_panel_height:     20
icon_size:                14
font:                     default
font_size:                9
autosave_last_session:    prompt
timeout:                  120
auto_load_model_stats:    True

--------- train.ini ---------

[global]
centering:                face
coverage:                 68.75
icnr_init:                False
conv_aware_init:          False
optimizer:                adam
learning_rate:            5e-05
epsilon_exponent:         -7
reflect_padding:          False
allow_growth:             False
mixed_precision:          False
nan_protection:           True
convert_batchsize:        16

[global.loss]
loss_function:            ssim
mask_loss_function:       mse
l2_reg_term:              100
eye_multiplier:           3
mouth_multiplier:         2
penalized_mask_loss:      True
mask_type:                extended
mask_blur_kernel:         3
mask_threshold:           4
learn_mask:               True

[model.dfaker]
output_size:              128

[model.dfl_h128]
lowmem:                   False

[model.dfl_sae]
input_size:               128
clipnorm:                 True
architecture:             df
autoencoder_dims:         0
encoder_dims:             42
decoder_dims:             21
multiscale_decoder:       False

[model.dlight]
features:                 best
details:                  good
output_size:              256

[model.original]
lowmem:                   True

[model.phaze_a]
output_size:              128
shared_fc:                None
enable_gblock:            True
split_fc:                 True
split_gblock:             False
split_decoders:           False
enc_architecture:         fs_original
enc_scaling:              40
enc_load_weights:         True
bottleneck_type:          dense
bottleneck_norm:          None
bottleneck_size:          1024
bottleneck_in_encoder:    True
fc_depth:                 1
fc_min_filters:           1024
fc_max_filters:           1024
fc_dimensions:            4
fc_filter_slope:          -0.5
fc_dropout:               0.0
fc_upsampler:             upsample2d
fc_upsamples:             1
fc_upsample_filters:      512
fc_gblock_depth:          3
fc_gblock_min_nodes:      512
fc_gblock_max_nodes:      512
fc_gblock_filter_slope:   -0.5
fc_gblock_dropout:        0.0
dec_upscale_method:       subpixel
dec_norm:                 None
dec_min_filters:          64
dec_max_filters:          512
dec_filter_slope:         -0.45
dec_res_blocks:           1
dec_output_kernel:        5
dec_gaussian:             True
dec_skip_last_residual:   True
freeze_layers:            keras_encoder
load_layers:              encoder
fs_original_depth:        4
fs_original_min_filters:  128
fs_original_max_filters:  1024
mobilenet_width:          1.0
mobilenet_depth:          1
mobilenet_dropout:        0.001

[model.realface]
input_size:               64
output_size:              128
dense_nodes:              1536
complexity_encoder:       128
complexity_decoder:       512

[model.unbalanced]
input_size:               128
lowmem:                   False
clipnorm:                 True
nodes:                    1024
complexity_encoder:       128
complexity_decoder_a:     384
complexity_decoder_b:     512

[model.villain]
lowmem:                   False

[trainer.original]
preview_images:           14
zoom_amount:              5
rotation_range:           10
shift_range:              5
flip_chance:              50
color_lightness:          30
color_ab:                 8
color_clahe_chance:       50
color_clahe_max_size:     4
User avatar
torzdf
Posts: 2636
Joined: Fri Jul 12, 2019 12:53 am
Answers: 156
Has thanked: 128 times
Been thanked: 614 times

Re: Model training crashing on AWS

Post by torzdf »

This is a memory error and is down to the host machine. There is clearly plenty of memory available, so I do not know why this error would occur. However, it may be related to how AWS sets up their machines.

I do not have much experience with AWS, so can't really help here. I have moved your post to cloud support in case anyone else can shed some light.

My word is final

Locked