Training freezes after 1100 iterations

kimjamess · Post by **kimjamess** » Sun Nov 20, 2022 3:13 pm

Hi,

I am trying to train phaze-A model for 30k iterations, but it stops(freezes) after around 1100 iterations.
So after 1100 iterations, it stops training (no updates in "phaze_a_state.json") without messages, no logs, no crash.
I also checked CPU/GPU memory utilization during training, and I found out that the utilization drops nearly 0 around the time the training reaches 1100ish iterations(maybe too obvious?). 1. What can be the possible causes of this?

Environment: AWS Sagemaker, EC2 instance "ml.p3.2xlarge" (GPU: Tesla V100, GPU memory: 16GB), CUDA version: 11.2, TensorFlow version: 2.6
(by the way it works completely fine with NVIDIA GeForce RTX 2070, CUDA 11.7)

Parameters: batch size=6, faceswap.log:
11/18/2022 15:56:58 MainProcess 11/18/2022 15:57:00 MainProcess 11/18/2022 15:57:00 MainProcess 11/18/2022 15:57:00 MainProcess 11/18/2022 15:57:00 MainProcess 11/18/2022 15:57:00 MainProcess 11/18/2022 15:57:00 MainProcess 11/18/2022 15:57:00 MainProcess 11/18/2022 15:57:00 MainProcess 11/18/2022 15:57:00 MainProcess 11/18/2022 15:57:00 MainProcess 11/18/2022 15:57:01 MainProcess 11/18/2022 15:57:01 MainProcess 11/18/2022 15:57:01 MainProcess 11/18/2022 15:57:01 MainProcess 11/18/2022 15:57:19 MainProcess 11/18/2022 15:57:19 MainProcess 11/18/2022 15:57:19 MainProcess 11/18/2022 15:57:20 MainProcess 11/18/2022 15:57:20 MainProcess 11/18/2022 15:57:20 MainProcess 11/18/2022 15:57:20 MainProcess 11/18/2022 15:58:04 MainProcess 11/18/2022 15:59:21 MainProcess 11/18/2022 16:00:37 MainProcess 11/18/2022 16:01:53 MainProcess 11/18/2022 16:03:09 MainProcess 11/18/2022 16:04:26 MainProcess 11/18/2022 16:05:47 MainProcess 11/18/2022 16:07:08 MainProcess 11/18/2022 16:08:24 MainProcess 11/18/2022 16:09:41 MainProcess 11/18/2022 16:10:57 MainProcess 11/18/2022 16:10:59 MainProcess 11/18/2022 16:12:20 MainProcess save_interval=100, snapshot_interval=1000
MainThread logger log_setup INFO Log level set to: INFO
MainThread train _get_images INFO Model A Directory: '<path>' (152 images)
MainThread train _get_images INFO Model B Directory: '<path>' (152 images)
MainThread train _validate_image_counts WARNING At least one of your input folders contains fewer than 250 images. Results are likely to be poor.
MainThread train _validate_image_counts WARNING You need to provide a significant number of images to successfully train a Neural Network. Aim for between 500 - 5000 images per side.
MainThread train process INFO Training data directory: <path>
MainThread train _monitor INFO ===================================================
MainThread train _monitor INFO Starting
MainThread train _monitor INFO Press 'ENTER' to save and quit
MainThread train _monitor INFO Press 'S' to save model weights immediately
MainThread train _monitor INFO ===================================================
_training_0 train _training INFO Loading data, this may take a while...
_training_0 plugin_loader _import INFO Loading Model from Phaze_A plugin...
_training_0 _base _load INFO No existing state file found. Generating.
_training_0 _base _set_tf_settings INFO Setting allow growth for GPU: PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
_training_0 _base load INFO Loading weights for layer 'encoder'
_training_0 _base load INFO Loading weights for layer 'fc_gblock'
_training_0 _base load INFO Loading weights for layer 'g_block_both'
_training_0 _base load INFO Loading weights for layer 'decoder_a'
_training_0 _base load INFO Loading weights for layer 'decoder_b'
_training_0 _base load WARNING Skipping layer decoder_both as not in current_model.
_training_0 plugin_loader _import INFO Loading Trainer from Original plugin...
_training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.10713, face_b: 0.13440
_training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.02849, face_b: 0.03427
_training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.02030, face_b: 0.02534
_training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01805, face_b: 0.02388
_training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01723, face_b: 0.02271
_training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01580, face_b: 0.02199
_training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01605, face_b: 0.02144
_training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01602, face_b: 0.02103
_training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01537, face_b: 0.02096
_training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01491, face_b: 0.01986
_training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01496, face_b: 0.01975
_training_0 backup_restore snapshot_models INFO Saved snapshot (1000 iterations)
_training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01510, face_b: 0.01953

(by the way, the training stops after 370ish iterations with save_interval=1.

Is the value of "save_interval" parameter matters?)

Any suggestions and tips will be appreciated.
Thank you in advance for your time and concern!

Post by **bryanlyon** » Sun Nov 20, 2022 6:37 pm

Hmmm, can't be sure, but it looks like a storage access problem. ml.p3.2xlarge doesn't have local storage and it's synchronized to the network. Not sure why the hangup is happening though. Try a much higher save interval, not a lower one. That sets how many iterations between saving out your models.

kimjamess · Post by **kimjamess** » Mon Nov 21, 2022 5:05 pm

Thank you for the reply. I'm now trying as your suggestions. But I actually don't get "storage access problem" you mentioned. Do you think the storage it is using is not enough?

I tried with "bigger storage" by adjusting parameter(volume_size), higher save interval(5000), but neither worked. Do you have any other suggestions?

Faceswap Forum

Training freezes after 1100 iterations

Training freezes after 1100 iterations

Re: Training freezes after 1100 iterations

Re: Training freezes after 1100 iterations