Transfer learning (load weights) fails with same model

If training is failing to start, and you are not receiving an error message telling you what to do, tell us about it here


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
acaint
Posts: 9
Joined: Sat Nov 06, 2021 3:04 pm
Answers: 1
Has thanked: 2 times

Transfer learning (load weights) fails with same model

Post by acaint »

Hello,

after training a model very far using faces A to B, I wished to see if I could give a boost for C to B with transfer learning.
I loaded the project of A to B, and only changed the input from A to C and the model directory.

Then I set the Load Weights to point the model directory of A to B.

The training could not start, however, due to error
11/08/2021 12:00:42 CRITICAL Error caught! Exiting...
11/08/2021 12:00:42 ERROR Caught exception in thread: '_training_0'
11/08/2021 12:00:42 ERROR You are attempting to load weights from a 'dfl_sae_df' model into a 'dfl_sae' model. This is not supported.

I wish to point out that both model configurations are exactly the same, DF as architecture in both.

Any instructions? Thanks in advance.

User avatar
torzdf
Posts: 2649
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 128 times
Been thanked: 623 times

Re: Transfer learning (load weights) fails with same model

Post by torzdf »

Possibly an oversight on my part re: the dfl-sae model. Is it possible for you to zip up and share your model for me to analyze? You can link it to me via private message

My word is final

User avatar
acaint
Posts: 9
Joined: Sat Nov 06, 2021 3:04 pm
Answers: 1
Has thanked: 2 times

Re: Transfer learning (load weights) fails with same model

Post by acaint »

Hi, thanks for quick reply. The model file is pretty large, over 2GB h5 file, and does not zip much smaller.

I'll try if I can reproduce the issue with smaller config.

Here is the dfl_sae_state.json , I am not sure if it is helpful.

Code: Select all

{
  "name": "dfl_sae",
  "sessions": {
    "1": {
      "timestamp": 1636302930.190047,
      "no_logs": false,
      "loss_names": [
        "total",
        "face_a_0",
        "face_a_1",
        "face_a_2",
        "mask_a",
        "face_b_0",
        "face_b_1",
        "face_b_2",
        "mask_b"
      ],
      "batchsize": 16,
      "iterations": 9501,
      "config": {
        "learning_rate": 5e-05,
        "epsilon_exponent": -7,
        "allow_growth": true,
        "nan_protection": true,
        "convert_batchsize": 4,
        "eye_multiplier": 3,
        "mouth_multiplier": 2,
        "clipnorm": true
      }
    },
    "2": {
      "timestamp": 1636313397.1599169,
      "no_logs": false,
      "loss_names": [
        "total",
        "face_a_0",
        "face_a_1",
        "face_a_2",
        "mask_a",
        "face_b_0",
        "face_b_1",
        "face_b_2",
        "mask_b"
      ],
      "batchsize": 16,
      "iterations": 35532,
      "config": {
        "learning_rate": 5e-05,
        "epsilon_exponent": -7,
        "allow_growth": true,
        "nan_protection": true,
        "convert_batchsize": 4,
        "eye_multiplier": 3,
        "mouth_multiplier": 2,
        "clipnorm": true
      }
    }
  },
  "lowest_avg_loss": {
    "a": 0.06255769795097876,
    "b": 0.06906740904378239
  },
  "iterations": 45033,
  "config": {
    "centering": "face",
    "coverage": 90.0,
    "optimizer": "adam",
    "learning_rate": 5e-05,
    "epsilon_exponent": -7,
    "allow_growth": true,
    "mixed_precision": true,
    "nan_protection": true,
    "convert_batchsize": 16,
    "loss_function": "ssim",
    "mask_loss_function": "mse",
    "l2_reg_term": 100,
    "eye_multiplier": 3,
    "mouth_multiplier": 2,
    "penalized_mask_loss": true,
    "mask_type": "bisenet-fp_face",
    "mask_blur_kernel": 3,
    "mask_threshold": 4,
    "learn_mask": true,
    "input_size": 256,
    "clipnorm": true,
    "architecture": "df",
    "autoencoder_dims": 640,
    "encoder_dims": 64,
    "decoder_dims": 32,
    "multiscale_decoder": true
  }
}
User avatar
torzdf
Posts: 2649
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 128 times
Been thanked: 623 times

Re: Transfer learning (load weights) fails with same model

Post by torzdf »

This should be fixed in the latest update. Please try updating.

My word is final

User avatar
torzdf
Posts: 2649
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 128 times
Been thanked: 623 times

Re: Transfer learning (load weights) fails with same model

Post by torzdf »

I had to roll back this fix. Will look for a different way. See here for reference:

viewtopic.php?f=6&t=1805

My word is final

User avatar
acaint
Posts: 9
Joined: Sat Nov 06, 2021 3:04 pm
Answers: 1
Has thanked: 2 times

Re: Transfer learning (load weights) fails with same model

Post by acaint »

Hi,

yes, thank you for quick action.
I was able to get it working by starting training with the transfer learning. It then crashed to 1 iteration as described in the other thread. But when I made copies of the generated 1 iteration model and its json, renamed them to old filenames, then it could proceed again.

However the graph and analysis -pages got frozen. But I use tensorboard to get over that.

In any case, the attempt is very much appreciated. I got the transfer learning working and the model is now training with some file name shenanigans.

2021-11-09 21:01 2 477 014 592 dfl_sae.h5 <-- same file as dfl_sae_df.h5 but renamed
2021-11-10 20:31 2 477 014 592 dfl_sae_df.h5
2021-11-10 19:22 2 477 014 592 dfl_sae_df.h5.bk
2021-11-10 20:31 2 170 dfl_sae_df_state.json
2021-11-10 19:22 2 169 dfl_sae_df_state.json.bk
2021-11-09 21:01 1 564 dfl_sae_state.json <-- same file as dfl_sae_df_state.json but renamed

It seems to only have them at start, it does not write to them...

User avatar
torzdf
Posts: 2649
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 128 times
Been thanked: 623 times

Re: Transfer learning (load weights) fails with same model

Post by torzdf »

Ok, this should now be fixed in a way which doesn't break backwards compatibility.

My word is final

User avatar
acaint
Posts: 9
Joined: Sat Nov 06, 2021 3:04 pm
Answers: 1
Has thanked: 2 times

Re: Transfer learning (load weights) fails with same model

Post by acaint »

Thank you, it is working brilliantly! =)

Locked