Page 1 of 1

Model training crashing on AWS - Memory Error

Posted: Tue Jun 22, 2021 7:58 pm
by alexsoares

Everything worked fine while I used my own notebook and I was able to build some models. I am now using and instance in Amazon services (AWS). I managed to install Faceswap, extract faces, sort them and adjust alignment using Manual menu. Now I am train to train the model and the application crashes.

Can you help me please?

Please see below the messages and the message box and, right below, a copy of the Crash Report.

Code: Select all

06/22/2021 19:46:42 INFO     Log level set to: INFO
06/22/2021 19:46:44 INFO     Model A Directory: 'C:\Users\Administrator\Documents\new project\new extract sorted' (5364 images)
06/22/2021 19:46:44 INFO     Model B Directory: 'C:\Users\Administrator\Documents\kt faces extract' (338 images)
06/22/2021 19:46:44 INFO     Training data directory: C:\Users\Administrator\Documents\new project\original model
06/22/2021 19:46:44 INFO     ===================================================
06/22/2021 19:46:44 INFO       Starting
06/22/2021 19:46:44 INFO       Press 'Stop' to save and quit
06/22/2021 19:46:44 INFO     ===================================================
06/22/2021 19:46:45 INFO     Loading data, this may take a while...
06/22/2021 19:46:45 INFO     Loading Model from Original plugin...
06/22/2021 19:46:46 INFO     No existing state file found. Generating.
06/22/2021 19:46:50 INFO     Loading Trainer from Original plugin...

06/22/2021 19:47:10 CRITICAL Error caught! Exiting...
06/22/2021 19:47:10 ERROR    Caught exception in thread: '_training_0'
06/22/2021 19:47:13 ERROR    Got Exception on main handler:
Traceback (most recent call last):
  File "C:\Users\Administrator\faceswap\lib\cli\launcher.py", line 182, in execute_script
    process.process()
  File "C:\Users\Administrator\faceswap\scripts\train.py", line 190, in process
    self._end_thread(thread, err)
  File "C:\Users\Administrator\faceswap\scripts\train.py", line 230, in _end_thread
    thread.join()
  File "C:\Users\Administrator\faceswap\lib\multithreading.py", line 121, in join
    raise thread.err[1].with_traceback(thread.err[2])
  File "C:\Users\Administrator\faceswap\lib\multithreading.py", line 37, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\Administrator\faceswap\scripts\train.py", line 252, in _training
    raise err
  File "C:\Users\Administrator\faceswap\scripts\train.py", line 242, in _training
    self._run_training_cycle(model, trainer)
  File "C:\Users\Administrator\faceswap\scripts\train.py", line 340, in _run_training_cycle
    model.save()
  File "C:\Users\Administrator\faceswap\plugins\train\model\_base.py", line 401, in save
    self._io._save()  # pylint:disable=protected-access
  File "C:\Users\Administrator\faceswap\plugins\train\model\_base.py", line 597, in _save
    self._plugin.model.save(self._filename, include_optimizer=False)
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1978, in save
    save.save_model(self, filepath, overwrite, include_optimizer, save_format,
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\saving\save.py", line 130, in save_model
    hdf5_format.save_model_to_hdf5(
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\saving\hdf5_format.py", line 119, in save_model_to_hdf5
    save_weights_to_hdf5_group(model_weights_group, model_layers)
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\saving\hdf5_format.py", line 636, in save_weights_to_hdf5_group
    weight_values = K.batch_get_value(weights)
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\util\dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\backend.py", line 3518, in batch_get_value
    return [x.numpy() for x in tensors]
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\backend.py", line 3518, in <listcomp>
    return [x.numpy() for x in tensors]
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\ops\resource_variable_ops.py", line 608, in numpy
    return self.read_value().numpy()
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\framework\ops.py", line 1064, in numpy
    return maybe_arr.copy() if isinstance(maybe_arr, np.ndarray) else maybe_arr
MemoryError: Unable to allocate 72.0 MiB for an array with shape (3, 3, 1024, 2048) and data type float32
06/22/2021 19:47:13 CRITICAL An unexpected crash has occurred. Crash report written to 'C:\Users\Administrator\faceswap\crash_report.2021.06.22.194710466076.log'. You MUST provide this file if seeking assistance. Please verify you are running the latest version of faceswap before reporting
Process exited.

CRASH REPORT:

Code: Select all

06/22/2021 19:46:54 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BBFF5550>, weight: 2.0, mask_channel: 5)
06/22/2021 19:46:54 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 5
06/22/2021 19:46:54 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BBFF5D30>, weight: 1.0, mask_channel: 2)
06/22/2021 19:46:54 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 2
06/22/2021 19:46:54 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC01A550>, weight: 1.0, mask_channel: 3)
06/22/2021 19:46:54 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 3
06/22/2021 19:46:54 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC01AD30>, weight: 1.0, mask_channel: 3)
06/22/2021 19:46:54 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 3
06/22/2021 19:46:54 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC3F3250>, weight: 3.0, mask_channel: 4)
06/22/2021 19:46:54 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 4
06/22/2021 19:46:54 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC3F3850>, weight: 1.0, mask_channel: 1)
06/22/2021 19:46:54 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 1
06/22/2021 19:46:54 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC3D2580>, weight: 2.0, mask_channel: 5)
06/22/2021 19:46:54 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 5
06/22/2021 19:46:54 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC3D2D60>, weight: 1.0, mask_channel: 2)
06/22/2021 19:46:54 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 2
06/22/2021 19:46:57 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC4FB8E0>, weight: 1.0, mask_channel: 3)
06/22/2021 19:46:57 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 3
06/22/2021 19:46:57 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC02C910>, weight: 1.0, mask_channel: 3)
06/22/2021 19:46:57 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 3
06/22/2021 19:46:57 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC0103A0>, weight: 3.0, mask_channel: 4)
06/22/2021 19:46:57 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 4
06/22/2021 19:46:57 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC010D60>, weight: 1.0, mask_channel: 1)
06/22/2021 19:46:57 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 1
06/22/2021 19:46:57 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BBFF5550>, weight: 2.0, mask_channel: 5)
06/22/2021 19:46:57 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 5
06/22/2021 19:46:57 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BBFF5D30>, weight: 1.0, mask_channel: 2)
06/22/2021 19:46:57 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 2
06/22/2021 19:46:58 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC01A550>, weight: 1.0, mask_channel: 3)
06/22/2021 19:46:58 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 3
06/22/2021 19:46:58 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC01AD30>, weight: 1.0, mask_channel: 3)
06/22/2021 19:46:58 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 3
06/22/2021 19:46:58 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC3F3250>, weight: 3.0, mask_channel: 4)
06/22/2021 19:46:58 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 4
06/22/2021 19:46:58 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC3F3850>, weight: 1.0, mask_channel: 1)
06/22/2021 19:46:58 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 1
06/22/2021 19:46:58 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC3D2580>, weight: 2.0, mask_channel: 5)
06/22/2021 19:46:58 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 5
06/22/2021 19:46:58 MainProcess     _training_0                    tmpl99scmcr     if_body                        DEBUG    Processing loss function: (func: <tensorflow.python.keras.engine.compile_utils.LossesContainer object at 0x00000229BC3D2D60>, weight: 1.0, mask_channel: 2)
06/22/2021 19:46:58 MainProcess     _training_0                    losses_tf       _apply_mask                    DEBUG    Applying mask from channel 2
06/22/2021 19:47:07 MainProcess     _training_0                    _base           generate_preview               DEBUG    Generating preview
06/22/2021 19:47:07 MainProcess     _training_0                    _base           compile_sample                 DEBUG    Compiling samples: (side: 'a', samples: 14)
06/22/2021 19:47:07 MainProcess     _training_0                    _base           compile_sample                 DEBUG    Compiling samples: (side: 'b', samples: 14)
06/22/2021 19:47:07 MainProcess     _training_0                    _base           show_sample                    DEBUG    Showing sample
06/22/2021 19:47:07 MainProcess     _training_0                    _base           _get_predictions               DEBUG    Getting Predictions
06/22/2021 19:47:07 MainProcess     _run_1                         generator       cache_metadata                 DEBUG    All metadata already cached for: ['01451.png', '03624.png', '00100.png', '02930.png', '05609.png', '02460.png', '02356.png', '03213.png', '04926.png', '04229.png', '04143.png', '03712.png', '01126.png', '04755.png']
06/22/2021 19:47:07 MainProcess     _run_1                         generator       cache_metadata                 DEBUG    All metadata already cached for: ['33622072_0_0.png', '02270001_0_0.png', '_DSC4103_0_0.png', '12140065_0_0.png', '_DSC4302_0_0.png', '12140023_0_0.png', '_DSC5958_0_0.png', 'Copy (3) of DSC_0029_0_0.png', '_DSC4105_0_0.png', '12140004_0_0.png', '_DSC4259_0_0.png', '20_0_0.png', '04160001_0_0.png', '_DSC4096_0_0.png']
06/22/2021 19:47:08 MainProcess     _training_0                    _base           _get_predictions               DEBUG    Returning predictions: {'a_a': (14, 64, 64, 4), 'b_b': (14, 64, 64, 4), 'a_b': (14, 64, 64, 4), 'b_a': (14, 64, 64, 4)}
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _to_full_frame                 DEBUG    side: 'a', number of sample arrays: 3, prediction.shapes: [(14, 64, 64, 4), (14, 64, 64, 4)])
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _process_full                  DEBUG    full_size: 384, prediction_size: 64, color: (0, 0, 255)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _resize_sample                 DEBUG    Resizing sample: (side: 'a', sample.shape: (14, 384, 384, 3), target_size: 92, scale: 0.23958333333333334)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _resize_sample                 DEBUG    Resized sample: (side: 'a' shape: (14, 92, 92, 3))
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _process_full                  DEBUG    Overlayed background. Shape: (14, 92, 92, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _compile_masked                DEBUG    masked shapes: [(14, 64, 64, 3), (14, 64, 64, 3), (14, 64, 64, 3)]
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _overlay_foreground            DEBUG    Overlayed foreground. Shape: (14, 92, 92, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _overlay_foreground            DEBUG    Overlayed foreground. Shape: (14, 92, 92, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _overlay_foreground            DEBUG    Overlayed foreground. Shape: (14, 92, 92, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _get_headers                   DEBUG    side: 'a', width: 92
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _get_headers                   DEBUG    height: 20, total_width: 276
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _get_headers                   DEBUG    texts: ['Original (A)', 'Original > Original', 'Original > Swap'], text_sizes: [(52, 7), (84, 7), (73, 7)], text_x: [20, 96, 193], text_y: 13
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _get_headers                   DEBUG    header_box.shape: (20, 276, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _to_full_frame                 DEBUG    side: 'b', number of sample arrays: 3, prediction.shapes: [(14, 64, 64, 4), (14, 64, 64, 4)])
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _process_full                  DEBUG    full_size: 384, prediction_size: 64, color: (0, 0, 255)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _resize_sample                 DEBUG    Resizing sample: (side: 'b', sample.shape: (14, 384, 384, 3), target_size: 92, scale: 0.23958333333333334)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _resize_sample                 DEBUG    Resized sample: (side: 'b' shape: (14, 92, 92, 3))
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _process_full                  DEBUG    Overlayed background. Shape: (14, 92, 92, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _compile_masked                DEBUG    masked shapes: [(14, 64, 64, 3), (14, 64, 64, 3), (14, 64, 64, 3)]
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _overlay_foreground            DEBUG    Overlayed foreground. Shape: (14, 92, 92, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _overlay_foreground            DEBUG    Overlayed foreground. Shape: (14, 92, 92, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _overlay_foreground            DEBUG    Overlayed foreground. Shape: (14, 92, 92, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _get_headers                   DEBUG    side: 'b', width: 92
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _get_headers                   DEBUG    height: 20, total_width: 276
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _get_headers                   DEBUG    texts: ['Swap (B)', 'Swap > Swap', 'Swap > Original'], text_sizes: [(43, 7), (63, 7), (73, 7)], text_x: [24, 106, 193], text_y: 13
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _get_headers                   DEBUG    header_box.shape: (20, 276, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _duplicate_headers             DEBUG    side: a header.shape: (20, 276, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _duplicate_headers             DEBUG    side: b header.shape: (20, 276, 3)
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _stack_images                  DEBUG    Stack images
06/22/2021 19:47:09 MainProcess     _training_0                    _base           get_transpose_axes             DEBUG    Even number of images to stack
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _stack_images                  DEBUG    Stacked images
06/22/2021 19:47:09 MainProcess     _training_0                    _base           show_sample                    DEBUG    Compiled sample
06/22/2021 19:47:09 MainProcess     _training_0                    train           _show                          DEBUG    Updating preview: (name: Training - 'S': Save Now. 'R': Refresh Preview. 'M': Toggle Mask. 'ENTER': Save and Quit)
06/22/2021 19:47:09 MainProcess     _training_0                    train           _show                          DEBUG    Generating preview for GUI
06/22/2021 19:47:09 MainProcess     _training_0                    train           _show                          DEBUG    Generated preview for GUI: '.gui_training_preview.jpg'
06/22/2021 19:47:09 MainProcess     _training_0                    train           _show                          DEBUG    Updated preview: (name: Training - 'S': Save Now. 'R': Refresh Preview. 'M': Toggle Mask. 'ENTER': Save and Quit)
06/22/2021 19:47:09 MainProcess     _training_0                    train           _run_training_cycle            DEBUG    Save Iteration: (iteration: 1
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _save                          DEBUG    Backing up and saving models
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _get_save_averages             DEBUG    Getting save averages
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _get_save_averages             DEBUG    Average losses since last save: [0.8635251820087433, 0.7475658655166626]
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _should_backup                 DEBUG    Set initial save iteration loss average for 'a': 0.8635251820087433
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _should_backup                 DEBUG    Set initial save iteration loss average for 'b': 0.7475658655166626
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _should_backup                 DEBUG    Updated lowest historical save iteration averages from: {'a': 0.8635251820087433, 'b': 0.7475658655166626} to: {'a': 0.8635251820087433, 'b': 0.7475658655166626}
06/22/2021 19:47:09 MainProcess     _training_0                    _base           _should_backup                 DEBUG    Should backup: True
06/22/2021 19:47:09 MainProcess     _training_0                    multithreading  run                            DEBUG    Error in thread (_training_0): Unable to allocate 72.0 MiB for an array with shape (3, 3, 1024, 2048) and data type float32
06/22/2021 19:47:10 MainProcess     MainThread                     train           _monitor                       DEBUG    Thread error detected
06/22/2021 19:47:10 MainProcess     MainThread                     train           _monitor                       DEBUG    Closed Monitor
06/22/2021 19:47:10 MainProcess     MainThread                     train           _end_thread                    DEBUG    Ending Training thread
06/22/2021 19:47:10 MainProcess     MainThread                     train           _end_thread                    CRITICAL Error caught! Exiting...
06/22/2021 19:47:10 MainProcess     MainThread                     multithreading  join                           DEBUG    Joining Threads: '_training'
06/22/2021 19:47:10 MainProcess     MainThread                     multithreading  join                           DEBUG    Joining Thread: '_training_0'
06/22/2021 19:47:10 MainProcess     MainThread                     multithreading  join                           ERROR    Caught exception in thread: '_training_0'
Traceback (most recent call last):
  File "C:\Users\Administrator\faceswap\lib\cli\launcher.py", line 182, in execute_script
    process.process()
  File "C:\Users\Administrator\faceswap\scripts\train.py", line 190, in process
    self._end_thread(thread, err)
  File "C:\Users\Administrator\faceswap\scripts\train.py", line 230, in _end_thread
    thread.join()
  File "C:\Users\Administrator\faceswap\lib\multithreading.py", line 121, in join
    raise thread.err[1].with_traceback(thread.err[2])
  File "C:\Users\Administrator\faceswap\lib\multithreading.py", line 37, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\Administrator\faceswap\scripts\train.py", line 252, in _training
    raise err
  File "C:\Users\Administrator\faceswap\scripts\train.py", line 242, in _training
    self._run_training_cycle(model, trainer)
  File "C:\Users\Administrator\faceswap\scripts\train.py", line 340, in _run_training_cycle
    model.save()
  File "C:\Users\Administrator\faceswap\plugins\train\model\_base.py", line 401, in save
    self._io._save()  # pylint:disable=protected-access
  File "C:\Users\Administrator\faceswap\plugins\train\model\_base.py", line 597, in _save
    self._plugin.model.save(self._filename, include_optimizer=False)
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1978, in save
    save.save_model(self, filepath, overwrite, include_optimizer, save_format,
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\saving\save.py", line 130, in save_model
    hdf5_format.save_model_to_hdf5(
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\saving\hdf5_format.py", line 119, in save_model_to_hdf5
    save_weights_to_hdf5_group(model_weights_group, model_layers)
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\saving\hdf5_format.py", line 636, in save_weights_to_hdf5_group
    weight_values = K.batch_get_value(weights)
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\util\dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\backend.py", line 3518, in batch_get_value
    return [x.numpy() for x in tensors]
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\keras\backend.py", line 3518, in <listcomp>
    return [x.numpy() for x in tensors]
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\ops\resource_variable_ops.py", line 608, in numpy
    return self.read_value().numpy()
  File "C:\Users\Administrator\MiniConda3\envs\faceswap\lib\site-packages\tensorflow\python\framework\ops.py", line 1064, in numpy
    return maybe_arr.copy() if isinstance(maybe_arr, np.ndarray) else maybe_arr
MemoryError: Unable to allocate 72.0 MiB for an array with shape (3, 3, 1024, 2048) and data type float32

============ System Information ============
encoding:            cp1252
git_branch:          master
git_commits:         55bb723 New Model: Phaze-A
gpu_cuda:            No global version found. Check Conda packages for Conda Cuda
gpu_cudnn:           No global version found. Check Conda packages for Conda cuDNN
gpu_devices:         GPU_0: Tesla T4
gpu_devices_active:  GPU_0
gpu_driver:          461.40
gpu_vram:            GPU_0: 15360MB
os_machine:          AMD64
os_platform:         Windows-10-10.0.17763-SP0
os_release:          10
py_command:          C:\Users\Administrator\faceswap\faceswap.py train -A C:/Users/Administrator/Documents/new project/new extract sorted -B C:/Users/Administrator/Documents/kt faces extract -m C:/Users/Administrator/Documents/new project/original model -t original -bs 4 -it 500000 -s 250 -ss 25000 -ps 100 -L INFO -gui
py_conda_version:    conda 4.10.1
py_implementation:   CPython
py_version:          3.8.10
py_virtual_env:      True
sys_cores:           4
sys_processor:       Intel64 Family 6 Model 85 Stepping 7, GenuineIntel
sys_ram:             Total: 16083MB, Available: 11639MB, Used: 4443MB, Free: 11639MB

=============== Pip Packages ===============
absl-py @ file:///C:/ci/absl-py_1615411229697/work
aiohttp @ file:///C:/ci/aiohttp_1614361024229/work
astor==0.8.1
astunparse==1.6.3
async-timeout==3.0.1
attrs @ file:///tmp/build/80754af9/attrs_1620827162558/work
blinker==1.4
brotlipy==0.7.0
cachetools @ file:///tmp/build/80754af9/cachetools_1619597386817/work
certifi==2021.5.30
cffi @ file:///C:/ci/cffi_1613247279197/work
chardet @ file:///C:/ci/chardet_1605303225733/work
click @ file:///tmp/build/80754af9/click_1621604852318/work
coverage @ file:///C:/ci/coverage_1614615074147/work
cryptography @ file:///C:/ci/cryptography_1616769344312/work
cycler==0.10.0
Cython @ file:///C:/ci/cython_1618435363327/work
fastcluster==1.1.26
ffmpy==0.2.3
gast @ file:///tmp/build/80754af9/gast_1597433534803/work
google-auth @ file:///tmp/build/80754af9/google-auth_1623354748502/work
google-auth-oauthlib @ file:///tmp/build/80754af9/google-auth-oauthlib_1617120569401/work
google-pasta==0.2.0
grpcio @ file:///C:/ci/grpcio_1614884412260/work
h5py==2.10.0
idna @ file:///home/linux1/recipes/ci/idna_1610986105248/work
imageio @ file:///tmp/build/80754af9/imageio_1617700267927/work
imageio-ffmpeg @ file:///home/conda/feedstock_root/build_artifacts/imageio-ffmpeg_1621542018480/work
importlib-metadata @ file:///C:/ci/importlib-metadata_1617877484576/work
joblib @ file:///tmp/build/80754af9/joblib_1613502643832/work
Keras-Applications @ file:///tmp/build/80754af9/keras-applications_1594366238411/work
Keras-Preprocessing @ file:///tmp/build/80754af9/keras-preprocessing_1612283640596/work
kiwisolver @ file:///C:/ci/kiwisolver_1612282606037/work
Markdown @ file:///C:/ci/markdown_1614364121613/work
matplotlib @ file:///C:/ci/matplotlib-base_1592837548929/work
mkl-fft==1.3.0
mkl-random==1.1.1
mkl-service==2.3.0
multidict @ file:///C:/ci/multidict_1607362065515/work
numpy @ file:///C:/ci/numpy_and_numpy_base_1603466732592/work
nvidia-ml-py3 @ git+https://github.com/deepfakes/nvidia-ml-py3.git@6fc29ac84b32bad877f078cb4a777c1548a00bf6
oauthlib==3.1.0
olefile==0.46
opencv-python==4.5.2.54
opt-einsum @ file:///tmp/build/80754af9/opt_einsum_1621500238896/work
pathlib==1.0.1
Pillow @ file:///C:/ci/pillow_1617386341487/work
protobuf==3.14.0
psutil @ file:///C:/ci/psutil_1612298324802/work
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser @ file:///tmp/build/80754af9/pycparser_1594388511720/work
PyJWT==1.7.1
pyOpenSSL @ file:///tmp/build/80754af9/pyopenssl_1608057966937/work
pyparsing @ file:///home/linux1/recipes/ci/pyparsing_1610983426697/work
pyreadline==2.1
PySocks @ file:///C:/ci/pysocks_1605287845585/work
python-dateutil @ file:///home/ktietz/src/ci/python-dateutil_1611928101742/work
pywin32==227
requests @ file:///tmp/build/80754af9/requests_1608241421344/work
requests-oauthlib==1.3.0
rsa @ file:///tmp/build/80754af9/rsa_1614366226499/work
scikit-learn @ file:///C:/ci/scikit-learn_1622739500535/work
scipy @ file:///C:/ci/scipy_1616703433439/work
sip==4.19.13
six @ file:///tmp/build/80754af9/six_1623709665295/work
tensorboard @ file:///home/builder/ktietz/aggregate/tensorflow_recipes/ci_te/tensorboard_1614593728657/work/tmp_pip_dir
tensorboard-plugin-wit==1.6.0
tensorflow==2.3.0
tensorflow-estimator @ file:///home/builder/ktietz/aggregate/tensorflow_recipes/ci_baze37/tensorflow-estimator_1622026529081/work/tensorflow_estimator-2.5.0-py2.py3-none-any.whl
termcolor==1.1.0
threadpoolctl @ file:///tmp/tmp9twdgx9k/threadpoolctl-2.1.0-py3-none-any.whl
tornado @ file:///C:/ci/tornado_1606942392901/work
tqdm @ file:///tmp/build/80754af9/tqdm_1615925068909/work
typing-extensions @ file:///tmp/build/80754af9/typing_extensions_1611751222202/work
urllib3 @ file:///tmp/build/80754af9/urllib3_1615837158687/work
Werkzeug @ file:///home/ktietz/src/ci/werkzeug_1611932622770/work
win-inet-pton @ file:///C:/ci/win_inet_pton_1605306167264/work
wincertstore==0.2
wrapt==1.12.1
yarl @ file:///C:/ci/yarl_1606940076464/work
zipp @ file:///tmp/build/80754af9/zipp_1615904174917/work

============== Conda Packages ==============
# packages in environment at C:\Users\Administrator\MiniConda3\envs\faceswap:
#
# Name                    Version                   Build  Channel
_tflow_select             2.3.0                       gpu  
absl-py                   0.12.0           py38haa95532_0  
aiohttp                   3.7.4            py38h2bbff1b_1  
astor                     0.8.1            py38haa95532_0  
astunparse                1.6.3                      py_0  
async-timeout             3.0.1            py38haa95532_0  
attrs                     21.2.0             pyhd3eb1b0_0  
blas                      1.0                         mkl  
blinker                   1.4              py38haa95532_0  
brotlipy                  0.7.0           py38h2bbff1b_1003  
ca-certificates           2021.5.25            haa95532_1  
cachetools                4.2.2              pyhd3eb1b0_0  
certifi                   2021.5.30        py38haa95532_0  
cffi                      1.14.5           py38hcd4344a_0  
chardet                   3.0.4           py38haa95532_1003  
click                     8.0.1              pyhd3eb1b0_0  
coverage                  5.5              py38h2bbff1b_2  
cryptography              3.4.7            py38h71e12ea_0  
cudatoolkit               10.1.243             h74a9793_0  
cudnn                     7.6.5                cuda10.1_0  
cycler                    0.10.0                   py38_0  
cython                    0.29.23          py38hd77b12b_0  
fastcluster               1.1.26           py38h251f6bf_2    conda-forge
ffmpeg                    4.3.1                ha925a31_0    conda-forge
ffmpy                     0.2.3                    pypi_0    pypi
freetype                  2.10.4               hd328e21_0  
gast                      0.4.0                      py_0  
git                       2.23.0               h6bb4b03_0  
google-auth               1.31.0             pyhd3eb1b0_0  
google-auth-oauthlib      0.4.4              pyhd3eb1b0_0  
google-pasta              0.2.0                      py_0  
grpcio                    1.36.1           py38hc60d5dd_1  
h5py                      2.10.0           py38h5e291fa_0  
hdf5                      1.10.4               h7ebc959_0  
icc_rt                    2019.0.0             h0cc432a_1  
icu                       58.2                 ha925a31_3  
idna                      2.10               pyhd3eb1b0_0  
imageio                   2.9.0              pyhd3eb1b0_0  
imageio-ffmpeg            0.4.4              pyhd8ed1ab_0    conda-forge
importlib-metadata        3.10.0           py38haa95532_0  
intel-openmp              2021.2.0           haa95532_616  
joblib                    1.0.1              pyhd3eb1b0_0  
jpeg                      9b                   hb83a4c4_2  
keras-applications        1.0.8                      py_1  
keras-preprocessing       1.1.2              pyhd3eb1b0_0  
kiwisolver                1.3.1            py38hd77b12b_0  
libpng                    1.6.37               h2a8f88b_0  
libprotobuf               3.14.0               h23ce68f_0  
libtiff                   4.2.0                hd0e1b90_0  
lz4-c                     1.9.3                h2bbff1b_0  
markdown                  3.3.4            py38haa95532_0  
matplotlib                3.2.2                         0  
matplotlib-base           3.2.2            py38h64f37c6_0  
mkl                       2020.2                      256  
mkl-service               2.3.0            py38h196d8e1_0  
mkl_fft                   1.3.0            py38h46781fe_0  
mkl_random                1.1.1            py38h47e9c7a_0  
multidict                 5.1.0            py38h2bbff1b_2  
numpy                     1.19.2           py38hadc3359_0  
numpy-base                1.19.2           py38ha3acd2a_0  
nvidia-ml-py3             7.352.1                  pypi_0    pypi
oauthlib                  3.1.0                      py_0  
olefile                   0.46                       py_0  
opencv-python             4.5.2.54                 pypi_0    pypi
openssl                   1.1.1k               h2bbff1b_0  
opt_einsum                3.3.0              pyhd3eb1b0_1  
pathlib                   1.0.1                      py_1  
pillow                    8.2.0            py38h4fa10fc_0  
pip                       21.1.2           py38haa95532_0  
protobuf                  3.14.0           py38hd77b12b_1  
psutil                    5.8.0            py38h2bbff1b_1  
pyasn1                    0.4.8                      py_0  
pyasn1-modules            0.2.8                      py_0  
pycparser                 2.20                       py_2  
pyjwt                     1.7.1                    py38_0  
pyopenssl                 20.0.1             pyhd3eb1b0_1  
pyparsing                 2.4.7              pyhd3eb1b0_0  
pyqt                      5.9.2            py38ha925a31_4  
pyreadline                2.1                      py38_1  
pysocks                   1.7.1            py38haa95532_0  
python                    3.8.10               hdbf39b2_7  
python-dateutil           2.8.1              pyhd3eb1b0_0  
python_abi                3.8                      1_cp38    conda-forge
pywin32                   227              py38he774522_1  
qt                        5.9.7            vc14h73c81de_0  
requests                  2.25.1             pyhd3eb1b0_0  
requests-oauthlib         1.3.0                      py_0  
rsa                       4.7.2              pyhd3eb1b0_1  
scikit-learn              0.24.2           py38hf11a4ad_1  
scipy                     1.6.2            py38h14eb087_0  
setuptools                52.0.0           py38haa95532_0  
sip                       4.19.13          py38ha925a31_0  
six                       1.16.0             pyhd3eb1b0_0  
sqlite                    3.35.4               h2bbff1b_0  
tensorboard               2.4.0              pyhc547734_0  
tensorboard-plugin-wit    1.6.0                      py_0  
tensorflow                2.3.0           mkl_py38h1fcfbd6_0  
tensorflow-base           2.3.0           gpu_py38h7339f5a_0  
tensorflow-estimator      2.5.0              pyh7b7c402_0  
tensorflow-gpu            2.3.0                he13fc11_0  
termcolor                 1.1.0            py38haa95532_1  
threadpoolctl             2.1.0              pyh5ca1d4c_0  
tk                        8.6.10               he774522_0  
tornado                   6.1              py38h2bbff1b_0  
tqdm                      4.59.0             pyhd3eb1b0_1  
typing-extensions         3.7.4.3              hd3eb1b0_0  
typing_extensions         3.7.4.3            pyh06a4308_0  
urllib3                   1.26.4             pyhd3eb1b0_0  
vc                        14.2                 h21ff451_1  
vs2015_runtime            14.27.29016          h5e58377_2  
werkzeug                  1.0.1              pyhd3eb1b0_0  
wheel                     0.36.2             pyhd3eb1b0_0  
win_inet_pton             1.1.0            py38haa95532_0  
wincertstore              0.2                      py38_0  
wrapt                     1.12.1           py38he774522_1  
xz                        5.2.5                h62dcd97_0  
yarl                      1.6.3            py38h2bbff1b_0  
zipp                      3.4.1              pyhd3eb1b0_0  
zlib                      1.2.11               h62dcd97_4  
zstd                      1.4.9                h19a0ad4_0  

================= Configs ==================
--------- .faceswap ---------
backend:                  nvidia

--------- convert.ini ---------

[color.color_transfer]
clip:                     True
preserve_paper:           True

[color.manual_balance]
colorspace:               HSV
balance_1:                0.0
balance_2:                0.0
balance_3:                0.0
contrast:                 0.0
brightness:               0.0

[color.match_hist]
threshold:                99.0

[mask.box_blend]
type:                     gaussian
distance:                 11.0
radius:                   5.0
passes:                   1

[mask.mask_blend]
type:                     normalized
kernel_size:              3
passes:                   4
threshold:                4
erosion:                  0.0

[scaling.sharpen]
method:                   none
amount:                   150
radius:                   0.3
threshold:                5.0

[writer.ffmpeg]
container:                mp4
codec:                    libx264
crf:                      23
preset:                   medium
tune:                     none
profile:                  auto
level:                    auto
skip_mux:                 False

[writer.gif]
fps:                      25
loop:                     0
palettesize:              256
subrectangles:            False

[writer.opencv]
format:                   png
draw_transparent:         False
jpg_quality:              75
png_compress_level:       3

[writer.pillow]
format:                   png
draw_transparent:         False
optimize:                 False
gif_interlace:            True
jpg_quality:              75
png_compress_level:       3
tif_compression:          tiff_deflate

--------- extract.ini ---------

[global]
allow_growth:             True

[align.fan]
batch-size:               12

[detect.cv2_dnn]
confidence:               50

[detect.mtcnn]
minsize:                  20
scalefactor:              0.709
batch-size:               8
threshold_1:              0.6
threshold_2:              0.7
threshold_3:              0.7

[detect.s3fd]
confidence:               70
batch-size:               4

[mask.bisenet_fp]
batch-size:               8
include_ears:             False
include_hair:             False
include_glasses:          True

[mask.unet_dfl]
batch-size:               8

[mask.vgg_clear]
batch-size:               6

[mask.vgg_obstructed]
batch-size:               2

--------- gui.ini ---------

[global]
fullscreen:               False
tab:                      extract
options_panel_width:      30
console_panel_height:     20
icon_size:                14
font:                     default
font_size:                9
autosave_last_session:    prompt
timeout:                  120
auto_load_model_stats:    True

--------- train.ini ---------

[global]
centering:                face
coverage:                 68.75
icnr_init:                False
conv_aware_init:          False
optimizer:                adam
learning_rate:            5e-05
epsilon_exponent:         -7
reflect_padding:          False
allow_growth:             False
mixed_precision:          False
nan_protection:           True
convert_batchsize:        16

[global.loss]
loss_function:            ssim
mask_loss_function:       mse
l2_reg_term:              100
eye_multiplier:           3
mouth_multiplier:         2
penalized_mask_loss:      True
mask_type:                extended
mask_blur_kernel:         3
mask_threshold:           4
learn_mask:               True

[model.dfaker]
output_size:              128

[model.dfl_h128]
lowmem:                   False

[model.dfl_sae]
input_size:               128
clipnorm:                 True
architecture:             df
autoencoder_dims:         0
encoder_dims:             42
decoder_dims:             21
multiscale_decoder:       False

[model.dlight]
features:                 best
details:                  good
output_size:              256

[model.original]
lowmem:                   True

[model.phaze_a]
output_size:              128
shared_fc:                None
enable_gblock:            True
split_fc:                 True
split_gblock:             False
split_decoders:           False
enc_architecture:         fs_original
enc_scaling:              40
enc_load_weights:         True
bottleneck_type:          dense
bottleneck_norm:          None
bottleneck_size:          1024
bottleneck_in_encoder:    True
fc_depth:                 1
fc_min_filters:           1024
fc_max_filters:           1024
fc_dimensions:            4
fc_filter_slope:          -0.5
fc_dropout:               0.0
fc_upsampler:             upsample2d
fc_upsamples:             1
fc_upsample_filters:      512
fc_gblock_depth:          3
fc_gblock_min_nodes:      512
fc_gblock_max_nodes:      512
fc_gblock_filter_slope:   -0.5
fc_gblock_dropout:        0.0
dec_upscale_method:       subpixel
dec_norm:                 None
dec_min_filters:          64
dec_max_filters:          512
dec_filter_slope:         -0.45
dec_res_blocks:           1
dec_output_kernel:        5
dec_gaussian:             True
dec_skip_last_residual:   True
freeze_layers:            keras_encoder
load_layers:              encoder
fs_original_depth:        4
fs_original_min_filters:  128
fs_original_max_filters:  1024
mobilenet_width:          1.0
mobilenet_depth:          1
mobilenet_dropout:        0.001

[model.realface]
input_size:               64
output_size:              128
dense_nodes:              1536
complexity_encoder:       128
complexity_decoder:       512

[model.unbalanced]
input_size:               128
lowmem:                   False
clipnorm:                 True
nodes:                    1024
complexity_encoder:       128
complexity_decoder_a:     384
complexity_decoder_b:     512

[model.villain]
lowmem:                   False

[trainer.original]
preview_images:           14
zoom_amount:              5
rotation_range:           10
shift_range:              5
flip_chance:              50
color_lightness:          30
color_ab:                 8
color_clahe_chance:       50
color_clahe_max_size:     4

Re: Model training crashing on AWS

Posted: Sat Jun 26, 2021 9:00 am
by torzdf

This is a memory error and is down to the host machine. There is clearly plenty of memory available, so I do not know why this error would occur. However, it may be related to how AWS sets up their machines.

I do not have much experience with AWS, so can't really help here. I have moved your post to cloud support in case anyone else can shed some light.