Loss cant go down spent over 48 hrs ?

If training is failing to start, and you are not receiving an error message telling you what to do, tell us about it here


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
adam_macchiato
Posts: 16
Joined: Tue Jul 26, 2022 5:26 am
Has thanked: 4 times

Loss cant go down spent over 48 hrs ?

Post by adam_macchiato »

Hi , I am using Phaze-A for traning with stojo present , its has been trained for 4days but the loss still keep on 0.0170.016 this 48hrs , still cant go down ,

here is my setting :

RTX3080
Phaze-A Stojo mode ( efficientnet v2_s)
Learning rate : 4.5e , EE :-5
Mixed Precision and Nan on ,
Face A and B both over 10K png
Iterations : 260K now

But other PC with 3090 ( efficientnet v2_l), same setting , Iterations 67k loss 0.014 already , so whats should i do ?

one more questions is , RTX3080 cant use ( efficientnet v2_l), cause out of memory , even i set down the batch size to 2 , Learning rate 3.5e still cant use , how can make it ? because 3090 setting result is much better than 3080 setting

thank you.

User avatar
torzdf
Posts: 2665
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 131 times
Been thanked: 625 times

Re: Loss cant go down spent over 48 hrs ?

Post by torzdf »

adam_macchiato wrote: Tue Jul 26, 2022 5:37 am

Hi , I am using Phaze-A for traning with stojo present , its has been trained for 4days but the loss still keep on 0.0170.016 this 48hrs , still cant go down ,

here is my setting :

RTX3080
Phaze-A Stojo mode ( efficientnet v2_s)
Learning rate : 4.5e , EE :-5
Mixed Precision and Nan on ,
Face A and B both over 10K png
Iterations : 260K now

But other PC with 3090 ( efficientnet v2_l), same setting , Iterations 67k loss 0.014 already , so whats should i do ?

Don't compare the loss values between mode settings. They are not directly comparable. The raw numbers are meaningless. All which matters is that they are going down. You can not normally tell this from the Graph tab, as it is zoomed out too far, but using the graph pop-out in the Analysis page will allow you to zoom in to the last few 10,000 iters and looking at the rolling average. See here: viewtopic.php?t=146#monitor

one more questions is , RTX3080 cant use ( efficientnet v2_l), cause out of memory , even i set down the batch size to 2 , Learning rate 3.5e still cant use , how can make it ? because 3090 setting result is much better than 3080 setting

Learning Rate does not impact VRAM usage. However efficientnet v2_L is a big model, so I am not surprised it runs out of VRAM. It also has an input size of 448px which is probably far larger than you require. Try using v2_S and lowering the encoder scaling.

My word is final

Locked