Page 1 of 1

Unable to start training - NaN error

Posted: Mon Jun 14, 2021 7:16 am
by Skoogz

I'm unable to start training as I get NaN error every time.

Is it normal that "Loading Trainer from Original plugin" takes about 15 minutes to load?

Code: Select all

06/14/2021 08:00:19 MainProcess     MainThread                     logger          log_setup                      INFO     Log level set to: INFO
06/14/2021 08:00:21 MainProcess     MainThread                     train           _get_images                    INFO     Model A Directory: 'C:\Users\Henri\Desktop\Deepfake\FaceA' (2718 images)
06/14/2021 08:00:21 MainProcess     MainThread                     train           _get_images                    INFO     Model B Directory: 'C:\Users\Henri\Desktop\Deepfake\FaceB' (2628 images)
06/14/2021 08:00:21 MainProcess     MainThread                     train           process                        INFO     Training data directory: C:\Users\Henri\Desktop\Deepfake\Model
06/14/2021 08:00:21 MainProcess     MainThread                     train           _monitor                       INFO     ===================================================
06/14/2021 08:00:21 MainProcess     MainThread                     train           _monitor                       INFO       Starting
06/14/2021 08:00:21 MainProcess     MainThread                     train           _monitor                       INFO       Press 'Stop' to save and quit
06/14/2021 08:00:21 MainProcess     MainThread                     train           _monitor                       INFO     ===================================================
06/14/2021 08:00:22 MainProcess     _training_0                    train           _training                      INFO     Loading data, this may take a while...
06/14/2021 08:00:22 MainProcess     _training_0                    plugin_loader   _import                        INFO     Loading Model from Original plugin...
06/14/2021 08:00:22 MainProcess     _training_0                    _base           _update_changed_config_items   INFO     Config item: 'allow_growth' has been updated from 'True' to 'False'
06/14/2021 08:00:22 MainProcess     _training_0                    _base           _replace_config                INFO     Using configuration saved in state file
06/14/2021 08:05:14 MainProcess     _training_0                    _base           _load                          INFO     Loaded model from disk: 'C:\Users\Henri\Desktop\Deepfake\Model\original.h5'
06/14/2021 08:05:14 MainProcess     _training_0                    plugin_loader   _import                        INFO     Loading Trainer from Original plugin...
06/14/2021 08:05:16 MainProcess     _training_0                    ag_logging      warn                           DEBUG    AutoGraph could not transform <bound method Logger.isEnabledFor of <FaceswapLogger lib.model.losses_tf (DEBUG)>> and will run it as-is.\nPlease report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.\nCause: module 'gast' has no attribute 'Index'\nTo silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
06/14/2021 08:05:16 MainProcess     _training_0                    ag_logging      warn                           DEBUG    AutoGraph could not transform <bound method Logger.findCaller of <FaceswapLogger lib.model.losses_tf (DEBUG)>> and will run it as-is.\nPlease report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.\nCause: module 'gast' has no attribute 'Index'\nTo silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
06/14/2021 08:05:16 MainProcess     _training_0                    ag_logging      warn                           DEBUG    AutoGraph could not transform <bound method Logger.makeRecord of <FaceswapLogger lib.model.losses_tf (DEBUG)>> and will run it as-is.\nPlease report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.\nCause: module 'gast' has no attribute 'Index'\nTo silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
06/14/2021 08:05:17 MainProcess     _training_0                    ag_logging      warn                           DEBUG    AutoGraph could not transform <bound method FaceswapFormatter.format of <lib.logger.FaceswapFormatter object at 0x00000166958A7F70>> and will run it as-is.\nPlease report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.\nCause: module 'gast' has no attribute 'Index'\nTo silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
06/14/2021 08:05:17 MainProcess     _training_0                    ag_logging      warn                           DEBUG    AutoGraph could not transform <bound method LossWrapper._apply_mask of <class 'lib.model.losses_tf.LossWrapper'>> and will run it as-is.\nPlease report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.\nCause: module 'gast' has no attribute 'Index'\nTo silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
06/14/2021 08:05:17 MainProcess     _training_0                    ag_logging      warn                           DEBUG    AutoGraph could not transform <bound method DSSIMObjective.call of <lib.model.losses_tf.DSSIMObjective object at 0x000001669EC32A60>> and will run it as-is.\nPlease report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.\nCause: module 'gast' has no attribute 'Index'\nTo silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
06/14/2021 08:22:08 MainProcess     _training_0                    _base           _collate_and_store_loss        CRITICAL NaN Detected. Loss: [nan, nan]
06/14/2021 08:22:08 MainProcess     MainThread                     train           _end_thread                    CRITICAL Error caught! Exiting...
06/14/2021 08:22:08 MainProcess     MainThread                     multithreading  join                           ERROR    Caught exception in thread: '_training_0'
06/14/2021 08:22:08 MainProcess     MainThread                     launcher        execute_script                 ERROR    A NaN was detected and you have NaN protection enabled. Training has been terminated.

Re: Unable to start training - NaN error

Posted: Mon Jun 14, 2021 8:25 am
by torzdf

Without knowing the specifics of your system setup, it's hard to know... But, generally, no. A model should start training within a minute or 2 (depending on the model).

NaNs on original model also suggest potential hardware/driver issues.


Re: Unable to start training - NaN error

Posted: Mon Jun 14, 2021 9:29 am
by Skoogz

I'm rocking a RTX 3060 with the latest NVIDIA drivers (466.77).

See further specs below:

Code: Select all

============ System Information ============
encoding:            cp1252
git_branch:          master
git_commits:         0775245 Bugfix - Manual Tool   - Fix bug when adding new face with "misaligned" filter applied
gpu_cuda:            No global version found. Check Conda packages for Conda Cuda
gpu_cudnn:           No global version found. Check Conda packages for Conda cuDNN
gpu_devices:         GPU_0: NVIDIA GeForce RTX 3060
gpu_devices_active:  GPU_0
gpu_driver:          466.77
gpu_vram:            GPU_0: 12288MB
os_machine:          AMD64
os_platform:         Windows-10-10.0.18363-SP0
os_release:          10
py_command:          C:\Users\Henri\faceswap/faceswap.py gui
py_conda_version:    conda 4.10.1
py_implementation:   CPython
py_version:          3.8.10
py_virtual_env:      True
sys_cores:           8
sys_processor:       Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
sys_ram:             Total: 16314MB, Available: 9111MB, Used: 7202MB, Free: 9111MB

Re: Unable to start training - NaN error

Posted: Mon Jun 14, 2021 9:34 am
by torzdf

Re: Unable to start training - NaN error

Posted: Mon Jun 14, 2021 9:59 am
by Skoogz

Thank you! I will try the guide.