Training Crashes switching from NVidia to AMD

If training is failing to start, and you are not receiving an error message telling you what to do, tell us about it here


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
jujuface
Posts: 2
Joined: Tue Jul 20, 2021 11:18 pm
Has thanked: 1 time

Training Crashes switching from NVidia to AMD

Post by jujuface »

I've been training a DFL SAE model on AWS with a Tesla T4 16GB GPU because my measly R9 390X 8GB can't handle anything above a batch size of 8. My plan was to train the model up to around 200K iterations on AWS and then copy it onto my PC to continue fit training and convert various other clips with a lower BS.

I've spent a bunch on AWS services to get the model to around 150K iterations so far, and decided to try training on my PC using copies of the AWS snapshots. I thought this should be no problem, except the trainer immediately crashes no matter which snapshot I use. I even checked all my training settings to make sure they match the AWS trainer.

My PC does just fine, however, if I use the same settings and training sets but with a model that was never trained on AWS. Am I doing something wrong or is switching a model from one GPU to another not supported? I also get a crash when trying to convert using the AWS model on my PC. I'm really hoping I won't have to spend more on AWS every time I want to use this model :(

Any help is appreciated.

Code: Select all

08/12/2021 20:02:03 MainProcess     _training_0                    config          add_item                       DEBUG    Add item: (section: 'model.realface', title: 'input_size', datatype: '<class 'int'>', default: '64', info: 'Resolution (in pixels) of the input image to train on.\nBE AWARE Larger resolution will dramatically increase VRAM requirements.\nHigher resolutions may increase prediction accuracy, but does not effect the resulting output size.\nMust be between 64 and 128 and be divisible by 16.', rounding: '16', min_max: (64, 128), choices: [], gui_radio: False, fixed: True, group: size)
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_item                       DEBUG    Add item: (section: 'model.realface', title: 'output_size', datatype: '<class 'int'>', default: '128', info: 'Output image resolution (in pixels).\nBe aware that larger resolution will increase VRAM requirements.\nNB: Must be between 64 and 256 and be divisible by 16.', rounding: '16', min_max: (64, 256), choices: [], gui_radio: False, fixed: True, group: size)
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_item                       DEBUG    Add item: (section: 'model.realface', title: 'dense_nodes', datatype: '<class 'int'>', default: '1536', info: 'Number of nodes for decoder. Might affect your model's ability to learn in general.\nNote that: Lower values will affect the ability to predict details.', rounding: '64', min_max: (768, 2048), choices: [], gui_radio: False, fixed: True, group: network)
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_item                       DEBUG    Add item: (section: 'model.realface', title: 'complexity_encoder', datatype: '<class 'int'>', default: '128', info: 'Encoder Convolution Layer Complexity. sensible ranges: 128 to 150.', rounding: '4', min_max: (96, 160), choices: [], gui_radio: False, fixed: True, group: network)
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_item                       DEBUG    Add item: (section: 'model.realface', title: 'complexity_decoder', datatype: '<class 'int'>', default: '512', info: 'Decoder Complexity.', rounding: '4', min_max: (512, 544), choices: [], gui_radio: False, fixed: True, group: network)
08/12/2021 20:02:03 MainProcess     _training_0                    config          _load_defaults_from_module     DEBUG    Added defaults: model.realface
08/12/2021 20:02:03 MainProcess     _training_0                    config          _load_defaults_from_module     DEBUG    Adding defaults: (filename: unbalanced_defaults.py, module_path: plugins.train.model, plugin_type: model
08/12/2021 20:02:03 MainProcess     _training_0                    config          _load_defaults_from_module     DEBUG    Importing defaults module: plugins.train.model.unbalanced_defaults
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_section                    DEBUG    Add section: (title: 'model.unbalanced', info: 'An unbalanced model with adjustable input size options.\nThis is an unbalanced model so b>a swaps may not work well\n')
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_item                       DEBUG    Add item: (section: 'model.unbalanced', title: 'input_size', datatype: '<class 'int'>', default: '128', info: 'Resolution (in pixels) of the image to train on.\nBE AWARE Larger resolution will dramatically increaseVRAM requirements.\nMake sure your resolution is divisible by 64 (e.g. 64, 128, 256 etc.).\nNB: Your faceset must be at least 1.6x larger than your required input size.\n(e.g. 160 is the maximum input size for a 256x256 faceset).', rounding: '64', min_max: (64, 512), choices: [], gui_radio: False, fixed: True, group: size)
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_item                       DEBUG    Add item: (section: 'model.unbalanced', title: 'lowmem', datatype: '<class 'bool'>', default: 'False', info: 'Lower memory mode. Set to 'True' if having issues with VRAM useage.\nNB: Models with a changed lowmem mode are not compatible with each other.\nNB: lowmem will override cutom nodes and complexity settings.', rounding: 'None', min_max: None, choices: [], gui_radio: False, fixed: True, group: settings)
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_item                       DEBUG    Add item: (section: 'model.unbalanced', title: 'clipnorm', datatype: '<class 'bool'>', default: 'True', info: 'Controls gradient clipping of the optimizer. Can prevent model corruption at the expense of VRAM.', rounding: 'None', min_max: None, choices: [], gui_radio: False, fixed: True, group: settings)
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_item                       DEBUG    Add item: (section: 'model.unbalanced', title: 'nodes', datatype: '<class 'int'>', default: '1024', info: 'Number of nodes for decoder. Don't change this unless you know what you are doing!', rounding: '64', min_max: (512, 4096), choices: [], gui_radio: False, fixed: True, group: network)
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_item                       DEBUG    Add item: (section: 'model.unbalanced', title: 'complexity_encoder', datatype: '<class 'int'>', default: '128', info: 'Encoder Convolution Layer Complexity. sensible ranges: 128 to 160.', rounding: '16', min_max: (64, 1024), choices: [], gui_radio: False, fixed: True, group: network)
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_item                       DEBUG    Add item: (section: 'model.unbalanced', title: 'complexity_decoder_a', datatype: '<class 'int'>', default: '384', info: 'Decoder A Complexity.', rounding: '16', min_max: (64, 1024), choices: [], gui_radio: False, fixed: True, group: network)
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_item                       DEBUG    Add item: (section: 'model.unbalanced', title: 'complexity_decoder_b', datatype: '<class 'int'>', default: '512', info: 'Decoder B Complexity.', rounding: '16', min_max: (64, 1024), choices: [], gui_radio: False, fixed: True, group: network)
08/12/2021 20:02:03 MainProcess     _training_0                    config          _load_defaults_from_module     DEBUG    Added defaults: model.unbalanced
08/12/2021 20:02:03 MainProcess     _training_0                    config          _load_defaults_from_module     DEBUG    Adding defaults: (filename: villain_defaults.py, module_path: plugins.train.model, plugin_type: model
08/12/2021 20:02:03 MainProcess     _training_0                    config          _load_defaults_from_module     DEBUG    Importing defaults module: plugins.train.model.villain_defaults
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_section                    DEBUG    Add section: (title: 'model.villain', info: 'A Higher resolution version of the Original Model by VillainGuy.\nExtremely VRAM heavy. Don't try to run this if you have a small GPU.\n')
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_item                       DEBUG    Add item: (section: 'model.villain', title: 'lowmem', datatype: '<class 'bool'>', default: 'False', info: 'Lower memory mode. Set to 'True' if having issues with VRAM useage.\nNB: Models with a changed lowmem mode are not compatible with each other.', rounding: 'None', min_max: None, choices: [], gui_radio: False, fixed: True, group: settings)
08/12/2021 20:02:03 MainProcess     _training_0                    config          _load_defaults_from_module     DEBUG    Added defaults: model.villain
08/12/2021 20:02:03 MainProcess     _training_0                    config          _load_defaults_from_module     DEBUG    Adding defaults: (filename: original_defaults.py, module_path: plugins.train.trainer, plugin_type: trainer
08/12/2021 20:02:03 MainProcess     _training_0                    config          _load_defaults_from_module     DEBUG    Importing defaults module: plugins.train.trainer.original_defaults
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_section                    DEBUG    Add section: (title: 'trainer.original', info: 'Original Trainer Options.\nWARNING: The defaults for augmentation will be fine for 99.9% of use cases. Only change them if you absolutely know what you are doing!')
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_item                       DEBUG    Add item: (section: 'trainer.original', title: 'preview_images', datatype: '<class 'int'>', default: '14', info: 'Number of sample faces to display for each side in the preview when training.', rounding: '2', min_max: (2, 16), choices: None, gui_radio: False, fixed: True, group: evaluation)
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_item                       DEBUG    Add item: (section: 'trainer.original', title: 'zoom_amount', datatype: '<class 'int'>', default: '5', info: 'Percentage amount to randomly zoom each training image in and out.', rounding: '1', min_max: (0, 25), choices: None, gui_radio: False, fixed: True, group: image augmentation)
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_item                       DEBUG    Add item: (section: 'trainer.original', title: 'rotation_range', datatype: '<class 'int'>', default: '10', info: 'Percentage amount to randomly rotate each training image.', rounding: '1', min_max: (0, 25), choices: None, gui_radio: False, fixed: True, group: image augmentation)
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_item                       DEBUG    Add item: (section: 'trainer.original', title: 'shift_range', datatype: '<class 'int'>', default: '5', info: 'Percentage amount to randomly shift each training image horizontally and vertically.', rounding: '1', min_max: (0, 25), choices: None, gui_radio: False, fixed: True, group: image augmentation)
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_item                       DEBUG    Add item: (section: 'trainer.original', title: 'flip_chance', datatype: '<class 'int'>', default: '50', info: 'Percentage chance to randomly flip each training image horizontally.\nNB: This is ignored if the 'no-flip' option is enabled', rounding: '1', min_max: (0, 75), choices: None, gui_radio: False, fixed: True, group: image augmentation)
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_item                       DEBUG    Add item: (section: 'trainer.original', title: 'color_lightness', datatype: '<class 'int'>', default: '30', info: 'Percentage amount to randomly alter the lightness of each training image.\nNB: This is ignored if the 'no-flip' option is enabled', rounding: '1', min_max: (0, 75), choices: None, gui_radio: False, fixed: True, group: color augmentation)
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_item                       DEBUG    Add item: (section: 'trainer.original', title: 'color_ab', datatype: '<class 'int'>', default: '8', info: 'Percentage amount to randomly alter the 'a' and 'b' colors of the L*a*b* color space of each training image.\nNB: This is ignored if the 'no-flip' option is enabled', rounding: '1', min_max: (0, 50), choices: None, gui_radio: False, fixed: True, group: color augmentation)
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_item                       DEBUG    Add item: (section: 'trainer.original', title: 'color_clahe_chance', datatype: '<class 'int'>', default: '50', info: 'Percentage chance to perform Contrast Limited Adaptive Histogram Equalization on each training image.\nNB: This is ignored if the 'no-augment-color' option is enabled', rounding: '1', min_max: (0, 75), choices: None, gui_radio: False, fixed: False, group: color augmentation)
08/12/2021 20:02:03 MainProcess     _training_0                    config          add_item                       DEBUG    Add item: (section: 'trainer.original', title: 'color_clahe_max_size', datatype: '<class 'int'>', default: '4', info: 'The grid size dictates how much Contrast Limited Adaptive Histogram Equalization is performed on any training image selected for clahe. Contrast will be applied randomly with a gridsize of 0 up to the maximum. This value is a multiplier calculated from the training image size.\nNB: This is ignored if the 'no-augment-color' option is enabled', rounding: '1', min_max: (1, 8), choices: None, gui_radio: False, fixed: True, group: color augmentation)
08/12/2021 20:02:03 MainProcess     _training_0                    config          _load_defaults_from_module     DEBUG    Added defaults: trainer.original
08/12/2021 20:02:03 MainProcess     _training_0                    config          handle_config                  DEBUG    Handling config: (section: model.dfl_sae, configfile: '[hidden]\faceswap\config\train.ini')
08/12/2021 20:02:03 MainProcess     _training_0                    config          check_exists                   DEBUG    Config file exists: '[hidden]\faceswap\config\train.ini'
08/12/2021 20:02:03 MainProcess     _training_0                    config          load_config                    VERBOSE  Loading config: '[hidden]\faceswap\config\train.ini'
08/12/2021 20:02:03 MainProcess     _training_0                    config          validate_config                DEBUG    Validating config
08/12/2021 20:02:03 MainProcess     _training_0                    config          check_config_change            DEBUG    Default config has not changed
08/12/2021 20:02:03 MainProcess     _training_0                    config          check_config_choices           DEBUG    Checking config choices
08/12/2021 20:02:03 MainProcess     _training_0                    config          _parse_list                    DEBUG    Processed raw option 'keras_encoder' to list ['keras_encoder'] for section 'model.phaze_a', option 'freeze_layers'
08/12/2021 20:02:03 MainProcess     _training_0                    config          _parse_list                    DEBUG    Processed raw option 'encoder' to list ['encoder'] for section 'model.phaze_a', option 'load_layers'
08/12/2021 20:02:03 MainProcess     _training_0                    config          check_config_choices           DEBUG    Checked config choices
08/12/2021 20:02:03 MainProcess     _training_0                    config          validate_config                DEBUG    Validated config
08/12/2021 20:02:03 MainProcess     _training_0                    config          handle_config                  DEBUG    Handled config
08/12/2021 20:02:03 MainProcess     _training_0                    config          __init__                       DEBUG    Initialized: Config
08/12/2021 20:02:03 MainProcess     _training_0                    config          get                            DEBUG    Getting config item: (section: 'global', option: 'learning_rate')
08/12/2021 20:02:03 MainProcess     _training_0                    config          get                            DEBUG    Returning item: (type: <class 'float'>, value: 5e-05)
08/12/2021 20:02:03 MainProcess     _training_0                    config          get                            DEBUG    Getting config item: (section: 'global', option: 'epsilon_exponent')
08/12/2021 20:02:03 MainProcess     _training_0                    config          get                            DEBUG    Returning item: (type: <class 'int'>, value: -7)
08/12/2021 20:02:03 MainProcess     _training_0                    config          get                            DEBUG    Getting config item: (section: 'global', option: 'allow_growth')
08/12/2021 20:02:03 MainProcess     _training_0                    config          get                            DEBUG    Returning item: (type: <class 'bool'>, value: False)
08/12/2021 20:02:03 MainProcess     _training_0                    config          get                            DEBUG    Getting config item: (section: 'global', option: 'nan_protection')
08/12/2021 20:02:03 MainProcess     _training_0                    config          get                            DEBUG    Returning item: (type: <class 'bool'>, value: True)
08/12/2021 20:02:03 MainProcess     _training_0                    config          get                            DEBUG    Getting config item: (section: 'global', option: 'convert_batchsize')
08/12/2021 20:02:03 MainProcess     _training_0                    config          get                            DEBUG    Returning item: (type: <class 'int'>, value: 16)
08/12/2021 20:02:03 MainProcess     _training_0                    config          get                            DEBUG    Getting config item: (section: 'global.loss', option: 'eye_multiplier')
08/12/2021 20:02:03 MainProcess     _training_0                    config          get                            DEBUG    Returning item: (type: <class 'int'>, value: 3)
08/12/2021 20:02:03 MainProcess     _training_0                    config          get                            DEBUG    Getting config item: (section: 'global.loss', option: 'mouth_multiplier')
08/12/2021 20:02:03 MainProcess     _training_0                    config          get                            DEBUG    Returning item: (type: <class 'int'>, value: 2)
08/12/2021 20:02:03 MainProcess     _training_0                    config          get                            DEBUG    Getting config item: (section: 'model.dfl_sae', option: 'clipnorm')
08/12/2021 20:02:03 MainProcess     _training_0                    config          get                            DEBUG    Returning item: (type: <class 'bool'>, value: True)
08/12/2021 20:02:03 MainProcess     _training_0                    config          changeable_items               DEBUG    Alterable for existing models: {'learning_rate': 5e-05, 'epsilon_exponent': -7, 'allow_growth': False, 'nan_protection': True, 'convert_batchsize': 16, 'eye_multiplier': 3, 'mouth_multiplier': 2, 'clipnorm': True}
08/12/2021 20:02:03 MainProcess     _training_0                    _base           __init__                       DEBUG    Initializing State: (model_dir: 'R:\Apps\nSwap 3\Models\SAE A1_snapshot_30000_iters - Copy', model_name: 'dfl_sae', config_changeable_items: '{'learning_rate': 5e-05, 'epsilon_exponent': -7, 'allow_growth': False, 'nan_protection': True, 'convert_batchsize': 16, 'eye_multiplier': 3, 'mouth_multiplier': 2, 'clipnorm': True}', no_logs: False
08/12/2021 20:02:03 MainProcess     _training_0                    serializer      get_serializer                 DEBUG    <lib.serializer._JSONSerializer object at 0x000001D57CFB2850>
08/12/2021 20:02:03 MainProcess     _training_0                    _base           _load                          DEBUG    Loading State
08/12/2021 20:02:03 MainProcess     _training_0                    serializer      load                           DEBUG    filename: R:\Apps\nSwap 3\Models\SAE A1_snapshot_30000_iters - Copy\dfl_sae_state.json
08/12/2021 20:02:03 MainProcess     _training_0                    serializer      load                           DEBUG    stored data type: <class 'bytes'>
08/12/2021 20:02:03 MainProcess     _training_0                    serializer      unmarshal                      DEBUG    data type: <class 'bytes'>
08/12/2021 20:02:03 MainProcess     _training_0                    serializer      unmarshal                      DEBUG    returned data type: <class 'dict'>
08/12/2021 20:02:03 MainProcess     _training_0                    serializer      load                           DEBUG    data type: <class 'dict'>
08/12/2021 20:02:03 MainProcess     _training_0                    _base           _load                          DEBUG    Loaded state: {'name': 'dfl_sae', 'sessions': {'1': {'timestamp': 1628450756.6982133, 'no_logs': False, 'loss_names': ['total', 'face_a', 'face_b'], 'batchsize': 16, 'iterations': 297, 'config': {'learning_rate': 5e-05, 'epsilon_exponent': -7, 'allow_growth': False, 'nan_protection': True, 'convert_batchsize': 16, 'eye_multiplier': 3, 'mouth_multiplier': 2, 'clipnorm': True}}, '2': {'timestamp': 1628451474.4808395, 'no_logs': False, 'loss_names': ['total', 'face_a', 'face_b'], 'batchsize': 16, 'iterations': 1, 'config': {'learning_rate': 5e-05, 'epsilon_exponent': -7, 'allow_growth': False, 'nan_protection': True, 'convert_batchsize': 16, 'eye_multiplier': 3, 'mouth_multiplier': 2, 'clipnorm': True}}, '3': {'timestamp': 1628451547.4369736, 'no_logs': False, 'loss_names': ['total', 'face_a', 'face_b'], 'batchsize': 32, 'iterations': 1669, 'config': {'learning_rate': 5e-05, 'epsilon_exponent': -7, 'allow_growth': False, 'nan_protection': True, 'convert_batchsize': 16, 'eye_multiplier': 3, 'mouth_multiplier': 2, 'clipnorm': True}}, '4': {'timestamp': 1628480266.6708539, 'no_logs': False, 'loss_names': ['total', 'face_a', 'face_b'], 'batchsize': 32, 'iterations': 23106, 'config': {'learning_rate': 5e-05, 'epsilon_exponent': -7, 'allow_growth': False, 'nan_protection': True, 'convert_batchsize': 16, 'eye_multiplier': 3, 'mouth_multiplier': 2, 'clipnorm': True}}, '5': {'timestamp': 1628549852.7428732, 'no_logs': False, 'loss_names': ['total', 'face_a', 'face_b'], 'batchsize': 32, 'iterations': 4750, 'config': {'learning_rate': 5e-05, 'epsilon_exponent': -7, 'allow_growth': False, 'nan_protection': True, 'convert_batchsize': 16, 'eye_multiplier': 3, 'mouth_multiplier': 2, 'clipnorm': True}}}, 'lowest_avg_loss': {'a': 0.039280573606491086, 'b': 0.025626418843865396}, 'iterations': 29823, 'config': {'centering': 'face', 'coverage': 68.75, 'optimizer': 'adam', 'learning_rate': 5e-05, 'epsilon_exponent': -7, 'allow_growth': False, 'mixed_precision': False, 'nan_protection': True, 'convert_batchsize': 16, 'loss_function': 'ssim', 'mask_loss_function': 'mse', 'l2_reg_term': 100, 'eye_multiplier': 3, 'mouth_multiplier': 2, 'penalized_mask_loss': True, 'mask_type': 'extended', 'mask_blur_kernel': 3, 'mask_threshold': 4, 'learn_mask': False, 'input_size': 128, 'clipnorm': True, 'architecture': 'df', 'autoencoder_dims': 0, 'encoder_dims': 42, 'decoder_dims': 21, 'multiscale_decoder': False}}
08/12/2021 20:02:03 MainProcess     _training_0                    _base           _update_legacy_config          DEBUG    Checking for legacy state file update
08/12/2021 20:02:03 MainProcess     _training_0                    _base           _update_legacy_config          DEBUG    Legacy item 'dssim_loss' not in config. Skipping update
08/12/2021 20:02:03 MainProcess     _training_0                    _base           _update_legacy_config          DEBUG    State file updated for legacy config: False
08/12/2021 20:02:03 MainProcess     _training_0                    _base           _replace_config                DEBUG    Replacing config. Old config: {'centering': 'face', 'coverage': 68.75, 'optimizer': 'adam', 'learning_rate': 5e-05, 'epsilon_exponent': -7, 'allow_growth': False, 'mixed_precision': False, 'nan_protection': True, 'convert_batchsize': 16, 'loss_function': 'ssim', 'mask_loss_function': 'mse', 'l2_reg_term': 100, 'eye_multiplier': 3, 'mouth_multiplier': 2, 'penalized_mask_loss': True, 'mask_type': 'extended', 'mask_blur_kernel': 3, 'mask_threshold': 4, 'learn_mask': False, 'input_size': 128, 'clipnorm': True, 'architecture': 'df', 'autoencoder_dims': 0, 'encoder_dims': 42, 'decoder_dims': 21, 'multiscale_decoder': False}
08/12/2021 20:02:03 MainProcess     _training_0                    _base           _replace_config                DEBUG    Replaced config. New config: {'centering': 'face', 'coverage': 68.75, 'optimizer': 'adam', 'learning_rate': 5e-05, 'epsilon_exponent': -7, 'allow_growth': False, 'mixed_precision': False, 'nan_protection': True, 'convert_batchsize': 16, 'loss_function': 'ssim', 'mask_loss_function': 'mse', 'l2_reg_term': 100, 'eye_multiplier': 3, 'mouth_multiplier': 2, 'penalized_mask_loss': True, 'mask_type': 'extended', 'mask_blur_kernel': 3, 'mask_threshold': 4, 'learn_mask': False, 'input_size': 128, 'clipnorm': True, 'architecture': 'df', 'autoencoder_dims': 0, 'encoder_dims': 42, 'decoder_dims': 21, 'multiscale_decoder': False}
08/12/2021 20:02:03 MainProcess     _training_0                    _base           _replace_config                INFO     Using configuration saved in state file
08/12/2021 20:02:03 MainProcess     _training_0                    _base           _new_session_id                DEBUG    6
08/12/2021 20:02:03 MainProcess     _training_0                    _base           _create_new_session            DEBUG    Creating new session. id: 6
08/12/2021 20:02:03 MainProcess     _training_0                    _base           __init__                       DEBUG    Initialized State:
08/12/2021 20:02:03 MainProcess     _training_0                    _base           __init__                       DEBUG    Initializing _Settings: (arguments: Namespace(batch_size=8, colab=False, configfile=None, distributed=False, exclude_gpus=None, freeze_weights=False, func=<bound method ScriptExecutor.execute_script of <lib.cli.launcher.ScriptExecutor object at 0x000001D565AF5EB0>>, input_a='R:\\Apps\\nSwap 3\\FaceA Training Set\\AC All Faces', input_b='R:\\Apps\\nSwap2\\FaceB Training Set\\All FaceB Combined Training Set', iterations=1000000, load_weights=None, logfile=None, loglevel='INFO', model_dir='R:\\Apps\\nSwap 3\\Models\\SAE A1_snapshot_30000_iters - Copy', no_augment_color=False, no_flip=False, no_logs=False, no_warp=False, preview=False, preview_scale=100, redirect_gui=True, save_interval=250, snapshot_interval=15000, summary=False, timelapse_input_a=None, timelapse_input_b=None, timelapse_output=None, trainer='dfl-sae', warp_to_landmarks=False, write_image=False), mixed_precision: False, allow_growth: False, is_predict: False)
08/12/2021 20:02:03 MainProcess     _training_0                    _base           _set_keras_mixed_precision     DEBUG    use_mixed_precision: False, exclude_gpus: False
08/12/2021 20:02:03 MainProcess     _training_0                    _base           _set_keras_mixed_precision     DEBUG    Not enabling 'mixed_precision' (backend: amd, use_mixed_precision: False)
08/12/2021 20:02:03 MainProcess     _training_0                    _base           _get_strategy                  DEBUG    Using strategy: None
08/12/2021 20:02:03 MainProcess     _training_0                    _base           __init__                       DEBUG    Initialized _Settings
08/12/2021 20:02:03 MainProcess     _training_0                    _base           __init__                       DEBUG    Initializing _Loss
08/12/2021 20:02:03 MainProcess     _training_0                    _base           __init__                       DEBUG    Initialized: _Loss
08/12/2021 20:02:03 MainProcess     _training_0                    _base           __init__                       DEBUG    Initialized ModelBase (Model)
08/12/2021 20:02:03 MainProcess     _training_0                    _base           strategy_scope                 DEBUG    Using strategy scope: <contextlib.nullcontext object at 0x000001D57CFEB1C0>
08/12/2021 20:02:03 MainProcess     _training_0                    _base           _load                          DEBUG    Loading model: R:\Apps\nSwap 3\Models\SAE A1_snapshot_30000_iters - Copy\dfl_sae.h5
08/12/2021 20:02:03 MainProcess     _training_0                    multithreading  run                            DEBUG    Error in thread (_training_0): Unknown layer: Functional
08/12/2021 20:02:04 MainProcess     MainThread                     train           _monitor                       DEBUG    Thread error detected
08/12/2021 20:02:04 MainProcess     MainThread                     train           _monitor                       DEBUG    Closed Monitor
08/12/2021 20:02:04 MainProcess     MainThread                     train           _end_thread                    DEBUG    Ending Training thread
08/12/2021 20:02:04 MainProcess     MainThread                     train           _end_thread                    CRITICAL Error caught! Exiting...
08/12/2021 20:02:04 MainProcess     MainThread                     multithreading  join                           DEBUG    Joining Threads: '_training'
08/12/2021 20:02:04 MainProcess     MainThread                     multithreading  join                           DEBUG    Joining Thread: '_training_0'
08/12/2021 20:02:04 MainProcess     MainThread                     multithreading  join                           ERROR    Caught exception in thread: '_training_0'
Traceback (most recent call last):
  File "[hidden]\faceswap\lib\cli\launcher.py", line 182, in execute_script
    process.process()
  File "[hidden]\faceswap\scripts\train.py", line 190, in process
    self._end_thread(thread, err)
  File "[hidden]\faceswap\scripts\train.py", line 230, in _end_thread
    thread.join()
  File "[hidden]\faceswap\lib\multithreading.py", line 121, in join
    raise thread.err[1].with_traceback(thread.err[2])
  File "[hidden]\faceswap\lib\multithreading.py", line 37, in run
    self._target(*self._args, **self._kwargs)
  File "[hidden]\faceswap\scripts\train.py", line 252, in _training
    raise err
  File "[hidden]\faceswap\scripts\train.py", line 240, in _training
    model = self._load_model()
  File "[hidden]\faceswap\scripts\train.py", line 268, in _load_model
    model.build()
  File "[hidden]\faceswap\plugins\train\model\_base.py", line 286, in build
    model = self._io._load()  # pylint:disable=protected-access
  File "[hidden]\faceswap\plugins\train\model\_base.py", line 556, in _load
    model = load_model(self._filename, compile=False)
  File "[hidden]\MiniConda3\envs\faceswap\lib\site-packages\keras\engine\saving.py", line 419, in load_model
    model = _deserialize_model(f, custom_objects, compile)
  File "[hidden]\MiniConda3\envs\faceswap\lib\site-packages\keras\engine\saving.py", line 225, in _deserialize_model
    model = model_from_config(model_config, custom_objects=custom_objects)
  File "[hidden]\MiniConda3\envs\faceswap\lib\site-packages\keras\engine\saving.py", line 458, in model_from_config
    return deserialize(config, custom_objects=custom_objects)
  File "[hidden]\MiniConda3\envs\faceswap\lib\site-packages\keras\layers\__init__.py", line 52, in deserialize
    return deserialize_keras_object(config,
  File "[hidden]\MiniConda3\envs\faceswap\lib\site-packages\keras\utils\generic_utils.py", line 137, in deserialize_keras_object
    raise ValueError('Unknown ' + printable_module_name +
ValueError: Unknown layer: Functional

============ System Information ============
encoding:            cp1252
git_branch:          Not Found
git_commits:         Not Found
gpu_cuda:            No global version found. Check Conda packages for Conda Cuda
gpu_cudnn:           No global version found. Check Conda packages for Conda cuDNN
gpu_devices:         GPU_0: Advanced Micro Devices, Inc. - Hawaii (experimental)
gpu_devices_active:  GPU_0
gpu_driver:          ['3240.6']
gpu_vram:            GPU_0: 8192MB
os_machine:          AMD64
os_platform:         Windows-10-10.0.19042-SP0
os_release:          10
py_command:          [hidden]\faceswap.py train -A R:/Apps/nSwap 3/FaceA Training Set/AC All Faces -B R:/Apps/nSwap2/FaceB Training Set/All FaceB Combined Training Set -m R:/Apps/nSwap 3/Models/SAE A1_snapshot_30000_iters - Copy -t dfl-sae -bs 8 -it 1000000 -s 250 -ss 15000 -ps 100 -L INFO -gui
py_conda_version:    conda 4.10.3
py_implementation:   CPython
py_version:          3.8.10
py_virtual_env:      True
sys_cores:           8
sys_processor:       Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
sys_ram:             Total: 16311MB, Available: 9451MB, Used: 6859MB, Free: 9451MB

=============== Pip Packages ===============
absl-py==0.13.0
astunparse==1.6.3
cachetools==4.2.2
certifi==2021.5.30
cffi==1.14.6
charset-normalizer==2.0.3
cycler==0.10.0
enum34==1.1.10
fastcluster==1.1.26
ffmpy==0.2.3
gast==0.3.3
google-auth==1.33.1
google-auth-oauthlib==0.4.4
google-pasta==0.2.0
grpcio==1.39.0
h5py==2.10.0
idna==3.2
imageio @ file:///tmp/build/80754af9/imageio_1617700267927/work
imageio-ffmpeg @ file:///home/conda/feedstock_root/build_artifacts/imageio-ffmpeg_1621542018480/work
joblib @ file:///tmp/build/80754af9/joblib_1613502643832/work
Keras==2.2.4
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.2
kiwisolver @ file:///C:/ci/kiwisolver_1612282606037/work
Markdown==3.3.4
matplotlib @ file:///C:/ci/matplotlib-base_1592837548929/work
mkl-fft==1.3.0
mkl-random==1.1.1
mkl-service==2.3.0
numpy==1.18.5
nvidia-ml-py3 @ git+https://github.com/deepfakes/nvidia-ml-py3.git@6fc29ac84b32bad877f078cb4a777c1548a00bf6
oauthlib==3.1.1
olefile==0.46
opencv-python==4.5.3.56
opt-einsum==3.3.0
pathlib==1.0.1
Pillow @ file:///C:/ci/pillow_1625663293593/work
plaidml==0.7.0
plaidml-keras==0.7.0
protobuf==3.17.3
psutil @ file:///C:/ci/psutil_1612298324802/work
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.20
pyparsing @ file:///home/linux1/recipes/ci/pyparsing_1610983426697/work
python-dateutil @ file:///tmp/build/80754af9/python-dateutil_1626374649649/work
pywin32==227
PyYAML==5.4.1
requests==2.26.0
requests-oauthlib==1.3.0
rsa==4.7.2
scikit-learn @ file:///C:/ci/scikit-learn_1622739500535/work
scipy @ file:///C:/ci/scipy_1616703433439/work
sip==4.19.13
six @ file:///tmp/build/80754af9/six_1623709665295/work
tensorboard==2.2.2
tensorboard-plugin-wit==1.8.0
tensorflow==2.2.3
tensorflow-estimator==2.2.0
termcolor==1.1.0
threadpoolctl @ file:///tmp/build/80754af9/threadpoolctl_1626115094421/work
tornado @ file:///C:/ci/tornado_1606942392901/work
tqdm @ file:///tmp/build/80754af9/tqdm_1625563689033/work
urllib3==1.26.6
Werkzeug==2.0.1
wincertstore==0.2
wrapt==1.12.1

============== Conda Packages ==============
# packages in environment at [hidden]\MiniConda3\envs\faceswap:
#
# Name                    Version                   Build  Channel
absl-py                   0.13.0                   pypi_0    pypi
astunparse                1.6.3                    pypi_0    pypi
blas                      1.0                         mkl  
ca-certificates 2021.7.5 haa95532_1
cachetools 4.2.2 pypi_0 pypi certifi 2021.5.30 py38haa95532_0
cffi 1.14.6 pypi_0 pypi charset-normalizer 2.0.3 pypi_0 pypi cycler 0.10.0 py38_0
enum34 1.1.10 pypi_0 pypi fastcluster 1.1.26 py38h251f6bf_2 conda-forge ffmpeg 4.3.1 ha925a31_0 conda-forge ffmpy 0.2.3 pypi_0 pypi freetype 2.10.4 hd328e21_0
gast 0.3.3 pypi_0 pypi git 2.23.0 h6bb4b03_0
google-auth 1.33.1 pypi_0 pypi google-auth-oauthlib 0.4.4 pypi_0 pypi google-pasta 0.2.0 pypi_0 pypi grpcio 1.39.0 pypi_0 pypi h5py 2.10.0 pypi_0 pypi icc_rt 2019.0.0 h0cc432a_1
icu 58.2 ha925a31_3
idna 3.2 pypi_0 pypi imageio 2.9.0 pyhd3eb1b0_0
imageio-ffmpeg 0.4.4 pyhd8ed1ab_0 conda-forge intel-openmp 2021.3.0 haa95532_3372
joblib 1.0.1 pyhd3eb1b0_0
jpeg 9b hb83a4c4_2
keras 2.2.4 pypi_0 pypi keras-applications 1.0.8 pypi_0 pypi keras-preprocessing 1.1.2 pypi_0 pypi kiwisolver 1.3.1 py38hd77b12b_0
libpng 1.6.37 h2a8f88b_0
libtiff 4.2.0 hd0e1b90_0
lz4-c 1.9.3 h2bbff1b_0
markdown 3.3.4 pypi_0 pypi matplotlib 3.2.2 0
matplotlib-base 3.2.2 py38h64f37c6_0
mkl 2020.2 256
mkl-service 2.3.0 py38h196d8e1_0
mkl_fft 1.3.0 py38h46781fe_0
mkl_random 1.1.1 py38h47e9c7a_0
numpy 1.18.5 pypi_0 pypi nvidia-ml-py3 7.352.1 pypi_0 pypi oauthlib 3.1.1 pypi_0 pypi olefile 0.46 py_0
opencv-python 4.5.3.56 pypi_0 pypi openssl 1.1.1k h2bbff1b_0
opt-einsum 3.3.0 pypi_0 pypi pathlib 1.0.1 py_1
pillow 8.3.1 py38h4fa10fc_0
pip 21.1.3 py38haa95532_0
plaidml 0.7.0 pypi_0 pypi plaidml-keras 0.7.0 pypi_0 pypi protobuf 3.17.3 pypi_0 pypi psutil 5.8.0 py38h2bbff1b_1
pyasn1 0.4.8 pypi_0 pypi pyasn1-modules 0.2.8 pypi_0 pypi pycparser 2.20 pypi_0 pypi pyparsing 2.4.7 pyhd3eb1b0_0
pyqt 5.9.2 py38ha925a31_4
python 3.8.10 hdbf39b2_7
python-dateutil 2.8.2 pyhd3eb1b0_0
python_abi 3.8 2_cp38 conda-forge pywin32 227 py38he774522_1
pyyaml 5.4.1 pypi_0 pypi qt 5.9.7 vc14h73c81de_0
requests 2.26.0 pypi_0 pypi requests-oauthlib 1.3.0 pypi_0 pypi rsa 4.7.2 pypi_0 pypi scikit-learn 0.24.2 py38hf11a4ad_1
scipy 1.6.2 py38h14eb087_0
setuptools 52.0.0 py38haa95532_0
sip 4.19.13 py38ha925a31_0
six 1.16.0 pyhd3eb1b0_0
sqlite 3.36.0 h2bbff1b_0
tensorboard 2.2.2 pypi_0 pypi tensorboard-plugin-wit 1.8.0 pypi_0 pypi tensorflow 2.2.3 pypi_0 pypi tensorflow-estimator 2.2.0 pypi_0 pypi termcolor 1.1.0 pypi_0 pypi threadpoolctl 2.2.0 pyhb85f177_0
tk 8.6.10 he774522_0
tornado 6.1 py38h2bbff1b_0
tqdm 4.61.2 pyhd3eb1b0_1
urllib3 1.26.6 pypi_0 pypi vc 14.2 h21ff451_1
vs2015_runtime 14.27.29016 h5e58377_2
werkzeug 2.0.1 pypi_0 pypi wheel 0.36.2 pyhd3eb1b0_0
wincertstore 0.2 py38_0
wrapt 1.12.1 pypi_0 pypi xz 5.2.5 h62dcd97_0
zlib 1.2.11 h62dcd97_4
zstd 1.4.9 h19a0ad4_0 ================= Configs ================== --------- .faceswap --------- backend: amd --------- convert.ini --------- [color.color_transfer] clip: True preserve_paper: True [color.manual_balance] colorspace: HSV balance_1: 0.0 balance_2: 0.0 balance_3: 0.0 contrast: 0.0 brightness: 0.0 [color.match_hist] threshold: 99.0 [mask.box_blend] type: gaussian distance: 11.0 radius: 5.0 passes: 1 [mask.mask_blend] type: gaussian kernel_size: 3 passes: 4 threshold: 4 erosion: 0.0 [scaling.sharpen] method: gaussian amount: 150 radius: 0.3 threshold: 5.0 [writer.ffmpeg] container: mp4 codec: libx264 crf: 23 preset: medium tune: none profile: auto level: auto skip_mux: False [writer.gif] fps: 25 loop: 0 palettesize: 256 subrectangles: False [writer.opencv] format: png draw_transparent: False jpg_quality: 75 png_compress_level: 3 [writer.pillow] format: png draw_transparent: False optimize: False gif_interlace: True jpg_quality: 75 png_compress_level: 3 tif_compression: tiff_deflate --------- extract.ini --------- [global] allow_growth: False [align.fan] batch-size: 12 [detect.cv2_dnn] confidence: 50 [detect.mtcnn] minsize: 20 scalefactor: 0.709 batch-size: 8 threshold_1: 0.6 threshold_2: 0.7 threshold_3: 0.7 [detect.s3fd] confidence: 70 batch-size: 4 [mask.bisenet_fp] batch-size: 8 include_ears: False include_hair: False include_glasses: True [mask.unet_dfl] batch-size: 8 [mask.vgg_clear] batch-size: 6 [mask.vgg_obstructed] batch-size: 2 --------- gui.ini --------- [global] fullscreen: False tab: extract options_panel_width: 30 console_panel_height: 20 icon_size: 14 font: default font_size: 12 autosave_last_session: prompt timeout: 120 auto_load_model_stats: True --------- train.ini --------- [global] centering: face coverage: 68.75 icnr_init: False conv_aware_init: False optimizer: adam learning_rate: 5e-05 epsilon_exponent: -7 reflect_padding: False allow_growth: False mixed_precision: False nan_protection: True convert_batchsize: 16 [global.loss] loss_function: ssim mask_loss_function: mse l2_reg_term: 100 eye_multiplier: 3 mouth_multiplier: 2 penalized_mask_loss: True mask_type: extended mask_blur_kernel: 3 mask_threshold: 4 learn_mask: False [model.dfaker] output_size: 128 [model.dfl_h128] lowmem: False [model.dfl_sae] input_size: 128 clipnorm: True architecture: df autoencoder_dims: 0 encoder_dims: 42 decoder_dims: 21 multiscale_decoder: False [model.dlight] features: best details: good output_size: 256 [model.original] lowmem: False [model.phaze_a] output_size: 128 shared_fc: None enable_gblock: True split_fc: True split_gblock: False split_decoders: False enc_architecture: fs_original enc_scaling: 40 enc_load_weights: True bottleneck_type: dense bottleneck_norm: None bottleneck_size: 1024 bottleneck_in_encoder: True fc_depth: 1 fc_min_filters: 1024 fc_max_filters: 1024 fc_dimensions: 4 fc_filter_slope: -0.5 fc_dropout: 0.0 fc_upsampler: upsample2d fc_upsamples: 1 fc_upsample_filters: 512 fc_gblock_depth: 3 fc_gblock_min_nodes: 512 fc_gblock_max_nodes: 512 fc_gblock_filter_slope: -0.5 fc_gblock_dropout: 0.0 dec_upscale_method: subpixel dec_norm: None dec_min_filters: 64 dec_max_filters: 512 dec_filter_slope: -0.45 dec_res_blocks: 1 dec_output_kernel: 5 dec_gaussian: True dec_skip_last_residual: True freeze_layers: keras_encoder load_layers: encoder fs_original_depth: 4 fs_original_min_filters: 128 fs_original_max_filters: 1024 mobilenet_width: 1.0 mobilenet_depth: 1 mobilenet_dropout: 0.001 [model.realface] input_size: 64 output_size: 128 dense_nodes: 1536 complexity_encoder: 128 complexity_decoder: 512 [model.unbalanced] input_size: 128 lowmem: False clipnorm: True nodes: 1024 complexity_encoder: 128 complexity_decoder_a: 384 complexity_decoder_b: 512 [model.villain] lowmem: False [trainer.original] preview_images: 14 zoom_amount: 5 rotation_range: 10 shift_range: 5 flip_chance: 50 color_lightness: 30 color_ab: 8 color_clahe_chance: 50 color_clahe_max_size: 4
by torzdf » Fri Aug 13, 2021 10:02 am

Unfortunately it is not possible to go cross manufacturer this way. We need to use a totally different backend to train models on AMD machines, there is literally nothing we can do about this if we want to keep AMD support.

There are a couple of potential solutions, neither of which I have tested, so could not confirm will work, you will need to test, and you may still run into failures.

The first is run the AMD version of Faceswap on the cloud machine. Whilst the Nvidia backend cannot be run on AMD cards (due to the requirement for Nvidia's Cuda), it is possible to run the AMD version of Faceswap on Nvidia cards as the backend we use for that version is platform agnostic. The AMD version of Faceswap does have a more limited feature-set however.

The other potential solution is to install the Nvidia version of Faceswap on the AMD machine and then replace the installed Tensorflow with a version compiled with ROCm support. As long as the version of Tensorflow used on both machines is the same, this should work. However, compiling Tensorflow is not a trivial task, and is certainly out of scope for support I could provide. I also do not know how well your GPU is supported by ROCm.

Someone on this forum has done the latter, albeit with an earlier version of Faceswap. You can find the thread here:
viewtopic.php?p=1852

Go to full post
User avatar
torzdf
Posts: 2651
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 129 times
Been thanked: 622 times

Re: Training Crashes switching from NVidia to AMD

Post by torzdf »

Unfortunately it is not possible to go cross manufacturer this way. We need to use a totally different backend to train models on AMD machines, there is literally nothing we can do about this if we want to keep AMD support.

There are a couple of potential solutions, neither of which I have tested, so could not confirm will work, you will need to test, and you may still run into failures.

The first is run the AMD version of Faceswap on the cloud machine. Whilst the Nvidia backend cannot be run on AMD cards (due to the requirement for Nvidia's Cuda), it is possible to run the AMD version of Faceswap on Nvidia cards as the backend we use for that version is platform agnostic. The AMD version of Faceswap does have a more limited feature-set however.

The other potential solution is to install the Nvidia version of Faceswap on the AMD machine and then replace the installed Tensorflow with a version compiled with ROCm support. As long as the version of Tensorflow used on both machines is the same, this should work. However, compiling Tensorflow is not a trivial task, and is certainly out of scope for support I could provide. I also do not know how well your GPU is supported by ROCm.

Someone on this forum has done the latter, albeit with an earlier version of Faceswap. You can find the thread here:
viewtopic.php?p=1852

My word is final

User avatar
jujuface
Posts: 2
Joined: Tue Jul 20, 2021 11:18 pm
Has thanked: 1 time

Re: Training Crashes switching from NVidia to AMD

Post by jujuface »

I was afraid of that, but also have myself to blame for going this far without running a test. I'm guessing that even if I installed the AMD version of Faceswap on the cloud machine, I'd have to start over with training.

That leaves me with installing an Nvidia version Faceswap on my PC, which would mean installing linux to switch out the Tensorflow. I'm by no means an expert in Python, but I may just take up that task once I have some more spare time. Thank you for your insight!

Locked