Training freezes after 1100 iterations

Want to understand the training process better? Got tips for which model to use and when? This is the place for you


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for discussing tips and understanding the process involved with Training a Faceswap model.

If you have found a bug are having issues with the Training process not working, then you should post in the Training Support forum.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
kimjamess
Posts: 2
Joined: Sun Nov 20, 2022 2:24 pm

Training freezes after 1100 iterations

Post by kimjamess »

Hi,

I am trying to train phaze-A model for 30k iterations, but it stops(freezes) after around 1100 iterations.
So after 1100 iterations, it stops training (no updates in "phaze_a_state.json") without messages, no logs, no crash.
I also checked CPU/GPU memory utilization during training, and I found out that the utilization drops nearly 0 around the time the training reaches 1100ish iterations(maybe too obvious?). 1. What can be the possible causes of this?

Environment: AWS Sagemaker, EC2 instance "ml.p3.2xlarge" (GPU: Tesla V100, GPU memory: 16GB), CUDA version: 11.2, TensorFlow version: 2.6
(by the way it works completely fine with NVIDIA GeForce RTX 2070, CUDA 11.7)

Parameters: batch size=6, save_interval=100, snapshot_interval=1000
faceswap.log:
11/18/2022 15:56:58 MainProcess MainThread logger log_setup INFO Log level set to: INFO
11/18/2022 15:57:00 MainProcess MainThread train _get_images INFO Model A Directory: '<path>' (152 images)
11/18/2022 15:57:00 MainProcess MainThread train _get_images INFO Model B Directory: '<path>' (152 images)
11/18/2022 15:57:00 MainProcess MainThread train _validate_image_counts WARNING At least one of your input folders contains fewer than 250 images. Results are likely to be poor.
11/18/2022 15:57:00 MainProcess MainThread train _validate_image_counts WARNING You need to provide a significant number of images to successfully train a Neural Network. Aim for between 500 - 5000 images per side.
11/18/2022 15:57:00 MainProcess MainThread train process INFO Training data directory: <path>
11/18/2022 15:57:00 MainProcess MainThread train _monitor INFO ===================================================
11/18/2022 15:57:00 MainProcess MainThread train _monitor INFO Starting
11/18/2022 15:57:00 MainProcess MainThread train _monitor INFO Press 'ENTER' to save and quit
11/18/2022 15:57:00 MainProcess MainThread train _monitor INFO Press 'S' to save model weights immediately
11/18/2022 15:57:00 MainProcess MainThread train _monitor INFO ===================================================
11/18/2022 15:57:01 MainProcess _training_0 train _training INFO Loading data, this may take a while...
11/18/2022 15:57:01 MainProcess _training_0 plugin_loader _import INFO Loading Model from Phaze_A plugin...
11/18/2022 15:57:01 MainProcess _training_0 _base _load INFO No existing state file found. Generating.
11/18/2022 15:57:01 MainProcess _training_0 _base _set_tf_settings INFO Setting allow growth for GPU: PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
11/18/2022 15:57:19 MainProcess _training_0 _base load INFO Loading weights for layer 'encoder'
11/18/2022 15:57:19 MainProcess _training_0 _base load INFO Loading weights for layer 'fc_gblock'
11/18/2022 15:57:19 MainProcess _training_0 _base load INFO Loading weights for layer 'g_block_both'
11/18/2022 15:57:20 MainProcess _training_0 _base load INFO Loading weights for layer 'decoder_a'
11/18/2022 15:57:20 MainProcess _training_0 _base load INFO Loading weights for layer 'decoder_b'
11/18/2022 15:57:20 MainProcess _training_0 _base load WARNING Skipping layer decoder_both as not in current_model.
11/18/2022 15:57:20 MainProcess _training_0 plugin_loader _import INFO Loading Trainer from Original plugin...
11/18/2022 15:58:04 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.10713, face_b: 0.13440
11/18/2022 15:59:21 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.02849, face_b: 0.03427
11/18/2022 16:00:37 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.02030, face_b: 0.02534
11/18/2022 16:01:53 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01805, face_b: 0.02388
11/18/2022 16:03:09 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01723, face_b: 0.02271
11/18/2022 16:04:26 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01580, face_b: 0.02199
11/18/2022 16:05:47 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01605, face_b: 0.02144
11/18/2022 16:07:08 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01602, face_b: 0.02103
11/18/2022 16:08:24 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01537, face_b: 0.02096
11/18/2022 16:09:41 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01491, face_b: 0.01986
11/18/2022 16:10:57 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01496, face_b: 0.01975
11/18/2022 16:10:59 MainProcess _training_0 backup_restore snapshot_models INFO Saved snapshot (1000 iterations)
11/18/2022 16:12:20 MainProcess _training_0 _base _save INFO [Saved models] - Average loss since last save: face_a: 0.01510, face_b: 0.01953

(by the way, the training stops after 370ish iterations with save_interval=1.

  1. Is the value of "save_interval" parameter matters?)

Any suggestions and tips will be appreciated.
Thank you in advance for your time and concern!

User avatar
bryanlyon
Site Admin
Posts: 793
Joined: Fri Jul 12, 2019 12:49 am
Answers: 44
Location: San Francisco
Has thanked: 4 times
Been thanked: 218 times
Contact:

Re: Training freezes after 1100 iterations

Post by bryanlyon »

Hmmm, can't be sure, but it looks like a storage access problem. ml.p3.2xlarge doesn't have local storage and it's synchronized to the network. Not sure why the hangup is happening though. Try a much higher save interval, not a lower one. That sets how many iterations between saving out your models.

User avatar
kimjamess
Posts: 2
Joined: Sun Nov 20, 2022 2:24 pm

Re: Training freezes after 1100 iterations

Post by kimjamess »

Thank you for the reply. I'm now trying as your suggestions. But I actually don't get "storage access problem" you mentioned. Do you think the storage it is using is not enough?

  • I tried with "bigger storage" by adjusting parameter(volume_size), higher save interval(5000), but neither worked. Do you have any other suggestions?
Locked