Unable to start training - NaN error

If training is failing to start, and you are not receiving an error message telling you what to do, tell us about it here


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
Skoogz
Posts: 3
Joined: Mon Jun 14, 2021 7:07 am

Unable to start training - NaN error

Post by Skoogz »

I'm unable to start training as I get NaN error every time.

Is it normal that "Loading Trainer from Original plugin" takes about 15 minutes to load?

Code: Select all

06/14/2021 08:00:19 MainProcess     MainThread                     logger          log_setup                      INFO     Log level set to: INFO
06/14/2021 08:00:21 MainProcess     MainThread                     train           _get_images                    INFO     Model A Directory: 'C:\Users\Henri\Desktop\Deepfake\FaceA' (2718 images)
06/14/2021 08:00:21 MainProcess     MainThread                     train           _get_images                    INFO     Model B Directory: 'C:\Users\Henri\Desktop\Deepfake\FaceB' (2628 images)
06/14/2021 08:00:21 MainProcess     MainThread                     train           process                        INFO     Training data directory: C:\Users\Henri\Desktop\Deepfake\Model
06/14/2021 08:00:21 MainProcess     MainThread                     train           _monitor                       INFO     ===================================================
06/14/2021 08:00:21 MainProcess     MainThread                     train           _monitor                       INFO       Starting
06/14/2021 08:00:21 MainProcess     MainThread                     train           _monitor                       INFO       Press 'Stop' to save and quit
06/14/2021 08:00:21 MainProcess     MainThread                     train           _monitor                       INFO     ===================================================
06/14/2021 08:00:22 MainProcess     _training_0                    train           _training                      INFO     Loading data, this may take a while...
06/14/2021 08:00:22 MainProcess     _training_0                    plugin_loader   _import                        INFO     Loading Model from Original plugin...
06/14/2021 08:00:22 MainProcess     _training_0                    _base           _update_changed_config_items   INFO     Config item: 'allow_growth' has been updated from 'True' to 'False'
06/14/2021 08:00:22 MainProcess     _training_0                    _base           _replace_config                INFO     Using configuration saved in state file
06/14/2021 08:05:14 MainProcess     _training_0                    _base           _load                          INFO     Loaded model from disk: 'C:\Users\Henri\Desktop\Deepfake\Model\original.h5'
06/14/2021 08:05:14 MainProcess     _training_0                    plugin_loader   _import                        INFO     Loading Trainer from Original plugin...
06/14/2021 08:05:16 MainProcess     _training_0                    ag_logging      warn                           DEBUG    AutoGraph could not transform <bound method Logger.isEnabledFor of <FaceswapLogger lib.model.losses_tf (DEBUG)>> and will run it as-is.\nPlease report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.\nCause: module 'gast' has no attribute 'Index'\nTo silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
06/14/2021 08:05:16 MainProcess     _training_0                    ag_logging      warn                           DEBUG    AutoGraph could not transform <bound method Logger.findCaller of <FaceswapLogger lib.model.losses_tf (DEBUG)>> and will run it as-is.\nPlease report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.\nCause: module 'gast' has no attribute 'Index'\nTo silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
06/14/2021 08:05:16 MainProcess     _training_0                    ag_logging      warn                           DEBUG    AutoGraph could not transform <bound method Logger.makeRecord of <FaceswapLogger lib.model.losses_tf (DEBUG)>> and will run it as-is.\nPlease report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.\nCause: module 'gast' has no attribute 'Index'\nTo silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
06/14/2021 08:05:17 MainProcess     _training_0                    ag_logging      warn                           DEBUG    AutoGraph could not transform <bound method FaceswapFormatter.format of <lib.logger.FaceswapFormatter object at 0x00000166958A7F70>> and will run it as-is.\nPlease report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.\nCause: module 'gast' has no attribute 'Index'\nTo silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
06/14/2021 08:05:17 MainProcess     _training_0                    ag_logging      warn                           DEBUG    AutoGraph could not transform <bound method LossWrapper._apply_mask of <class 'lib.model.losses_tf.LossWrapper'>> and will run it as-is.\nPlease report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.\nCause: module 'gast' has no attribute 'Index'\nTo silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
06/14/2021 08:05:17 MainProcess     _training_0                    ag_logging      warn                           DEBUG    AutoGraph could not transform <bound method DSSIMObjective.call of <lib.model.losses_tf.DSSIMObjective object at 0x000001669EC32A60>> and will run it as-is.\nPlease report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.\nCause: module 'gast' has no attribute 'Index'\nTo silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
06/14/2021 08:22:08 MainProcess     _training_0                    _base           _collate_and_store_loss        CRITICAL NaN Detected. Loss: [nan, nan]
06/14/2021 08:22:08 MainProcess     MainThread                     train           _end_thread                    CRITICAL Error caught! Exiting...
06/14/2021 08:22:08 MainProcess     MainThread                     multithreading  join                           ERROR    Caught exception in thread: '_training_0'
06/14/2021 08:22:08 MainProcess     MainThread                     launcher        execute_script                 ERROR    A NaN was detected and you have NaN protection enabled. Training has been terminated.
by torzdf » Mon Jun 14, 2021 9:34 am Go to full post
User avatar
torzdf
Posts: 2651
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 129 times
Been thanked: 622 times

Re: Unable to start training - NaN error

Post by torzdf »

Without knowing the specifics of your system setup, it's hard to know... But, generally, no. A model should start training within a minute or 2 (depending on the model).

NaNs on original model also suggest potential hardware/driver issues.

My word is final

User avatar
Skoogz
Posts: 3
Joined: Mon Jun 14, 2021 7:07 am

Re: Unable to start training - NaN error

Post by Skoogz »

I'm rocking a RTX 3060 with the latest NVIDIA drivers (466.77).

See further specs below:

Code: Select all

============ System Information ============
encoding:            cp1252
git_branch:          master
git_commits:         0775245 Bugfix - Manual Tool   - Fix bug when adding new face with "misaligned" filter applied
gpu_cuda:            No global version found. Check Conda packages for Conda Cuda
gpu_cudnn:           No global version found. Check Conda packages for Conda cuDNN
gpu_devices:         GPU_0: NVIDIA GeForce RTX 3060
gpu_devices_active:  GPU_0
gpu_driver:          466.77
gpu_vram:            GPU_0: 12288MB
os_machine:          AMD64
os_platform:         Windows-10-10.0.18363-SP0
os_release:          10
py_command:          C:\Users\Henri\faceswap/faceswap.py gui
py_conda_version:    conda 4.10.1
py_implementation:   CPython
py_version:          3.8.10
py_virtual_env:      True
sys_cores:           8
sys_processor:       Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
sys_ram:             Total: 16314MB, Available: 9111MB, Used: 7202MB, Free: 9111MB
User avatar
torzdf
Posts: 2651
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 129 times
Been thanked: 622 times

Re: Unable to start training - NaN error

Post by torzdf »

My word is final

User avatar
Skoogz
Posts: 3
Joined: Mon Jun 14, 2021 7:07 am

Re: Unable to start training - NaN error

Post by Skoogz »

Thank you! I will try the guide.

Locked