Page 1 of 1

Training freezes after 1100 iterations

Posted: Sun Nov 20, 2022 3:13 pm
by kimjamess

Hi,

I am trying to train phaze-A model for 30k iterations, but it stops(freezes) after around 1100 iterations.
So after 1100 iterations, it stops training (no updates in "phaze_a_state.json") without messages, no logs, no crash.
I also checked CPU/GPU memory utilization during training, and I found out that the utilization drops nearly 0 around the time the training reaches 1100ish iterations(maybe too obvious?). 1. What can be the possible causes of this?

Environment: AWS Sagemaker, EC2 instance "ml.p3.2xlarge" (GPU: Tesla V100, GPU memory: 16GB), CUDA version: 11.2, TensorFlow version: 2.6
(by the way it works completely fine with NVIDIA GeForce RTX 2070, CUDA 11.7)

Parameters: batch size=6, save_interval=100, snapshot_interval=1000
faceswap.log:
11/18/2022 15:56:58 MainProcess MainThread logger log_setup INFO Log level set to: INFO
11/18/2022 15:57:00 MainProcess MainThread train _get_images INFO Model A Directory: '<path>' (152 images)
11/18/2022 15:57:00 MainProcess MainThread train _get_images INFO Model B Directory: '<path>' (152 images)
11/18/2022 15:57:00 MainProcess MainThread train _validate_image_counts WARNING At least one of your input folders contains fewer than 250 images. Results are likely to be poor.
11/18/2022 15:57:00 MainProcess MainThread train _validate_image_counts WARNING You need to provide a significant number of images to successfully train a Neural Network. Aim for between 500 - 5000 images per side.
11/18/2022 15:57:00 MainProcess MainThread train process INFO Training data directory: <path>
11/18/2022 15:57:00 MainProcess MainThread train _monitor INFO ===================================================
11/18/2022 15:57:00 MainProcess MainThread train _monitor INFO Starting
11/18/2022 15:57:00 MainProcess MainThread train _monitor INFO Press 'ENTER' to save and quit
11/18/2022 15:57:00 MainProcess MainThread train _monitor INFO Press 'S' to save model weights immediately
11/18/2022 15:57:00 MainProcess MainThread train _monitor INFO ===================================================
11/18/2022 15:57:01 MainProcess _training_0 train _training INFO Loading data, this may take a while...
11/18/2022 15:57:01 MainProcess _training_0 plugin_loader _import INFO Loading Model from Phaze_A plugin...
11/18/2022 15:57:01 MainProcess _training_0 _base _load INFO No existing state file found. Generating.
11/18/2022 15:57:01 MainProcess _training_0 _base _set_tf_settings INFO Setting allow growth for GPU: PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
11/18/2022 15:57:19 MainProcess _training_0 _base load INFO Loading weights for layer 'encoder'
11/18/2022 15:57:19 MainProcess _training_0 _base load INFO Loading weights for layer 'fc_gblock'
11/18/2022 15:57:19 MainProcess _training_0 _base load INFO Loading weights for layer 'g_block_both'
11/18/2022 15:57:20 MainProcess _training_0 _base load INFO Loading weights for layer 'decoder_a'
11/18/2022 15:57:20 MainProcess _training_0 _base load INFO Loading weights for layer 'decoder_b'
11/18/2022 15:57:20 MainProcess _training_0 _base load WARNING Skipping layer decoder_both as not in current_model.
11/18/2022 15:57:20 MainProcess _training_0 plugin_loader _import INFO Loading Trainer from Original plugin...
11/18/2022 15:58:04 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.10713, face_b: 0.13440
11/18/2022 15:59:21 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.02849, face_b: 0.03427
11/18/2022 16:00:37 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.02030, face_b: 0.02534
11/18/2022 16:01:53 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01805, face_b: 0.02388
11/18/2022 16:03:09 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01723, face_b: 0.02271
11/18/2022 16:04:26 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01580, face_b: 0.02199
11/18/2022 16:05:47 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01605, face_b: 0.02144
11/18/2022 16:07:08 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01602, face_b: 0.02103
11/18/2022 16:08:24 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01537, face_b: 0.02096
11/18/2022 16:09:41 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01491, face_b: 0.01986
11/18/2022 16:10:57 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01496, face_b: 0.01975
11/18/2022 16:10:59 MainProcess _training_0 backup_restore snapshot_models INFO Saved snapshot (1000 iterations)
11/18/2022 16:12:20 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01510, face_b: 0.01953

(by the way, the training stops after 370ish iterations with save_interval=1.

  1. Is the value of "save_interval" parameter matters?)

Any suggestions and tips will be appreciated.
Thank you in advance for your time and concern!


Re: Training freezes after 1100 iterations

Posted: Sun Nov 20, 2022 6:37 pm
by bryanlyon

Hmmm, can't be sure, but it looks like a storage access problem. ml.p3.2xlarge doesn't have local storage and it's synchronized to the network. Not sure why the hangup is happening though. Try a much higher save interval, not a lower one. That sets how many iterations between saving out your models.


Re: Training freezes after 1100 iterations

Posted: Mon Nov 21, 2022 5:05 pm
by kimjamess

Thank you for the reply. I'm now trying as your suggestions. But I actually don't get "storage access problem" you mentioned. Do you think the storage it is using is not enough?

  • I tried with "bigger storage" by adjusting parameter(volume_size), higher save interval(5000), but neither worked. Do you have any other suggestions?