Erm... Loss A: nan, Loss B: nan - what's going on?

If training is failing to start, and you are not receiving an error message telling you what to do, tell us about it here


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
yking90
Posts: 17
Joined: Fri Jan 01, 2021 5:23 pm
Has thanked: 3 times

Erm... Loss A: nan, Loss B: nan - what's going on?

Post by yking90 »

Have been training for over 18 hours - it started out like this and it's still the same... what could cause this?

Image

User avatar
yking90
Posts: 17
Joined: Fri Jan 01, 2021 5:23 pm
Has thanked: 3 times

Re: Erm... Loss A: nan, Loss B: nan - what's going on?

Post by yking90 »

Damn! I think i figured it out...

I was using Distributed training across 3090 and 2080....

Re-started it by excluding the 3090 and it's working.

FYI - for anyone who may face it :)

User avatar
bryanlyon
Site Admin
Posts: 793
Joined: Fri Jul 12, 2019 12:49 am
Answers: 44
Location: San Francisco
Has thanked: 4 times
Been thanked: 218 times
Contact:

Re: Erm... Loss A: nan, Loss B: nan - what's going on?

Post by bryanlyon »

Actually, app.php/faqpage#f3r6 It's a somewhat common occurrence, though using two different generations of GPU may make it more common.

User avatar
yking90
Posts: 17
Joined: Fri Jan 01, 2021 5:23 pm
Has thanked: 3 times

Re: Erm... Loss A: nan, Loss B: nan - what's going on?

Post by yking90 »

I'm assuming that once we get to a point where 3x cards are supported, using 2 GPUs power would actually be beneficial?

User avatar
bryanlyon
Site Admin
Posts: 793
Joined: Fri Jul 12, 2019 12:49 am
Answers: 44
Location: San Francisco
Has thanked: 4 times
Been thanked: 218 times
Contact:

Re: Erm... Loss A: nan, Loss B: nan - what's going on?

Post by bryanlyon »

You can already do multiple GPU. That's what the distributed feature is for. Mixing generations (30xx and 20xx) will probably never be worthwhile however.

User avatar
yking90
Posts: 17
Joined: Fri Jan 01, 2021 5:23 pm
Has thanked: 3 times

Re: Erm... Loss A: nan, Loss B: nan - what's going on?

Post by yking90 »

Could you explain why so?

User avatar
bryanlyon
Site Admin
Posts: 793
Joined: Fri Jul 12, 2019 12:49 am
Answers: 44
Location: San Francisco
Has thanked: 4 times
Been thanked: 218 times
Contact:

Re: Erm... Loss A: nan, Loss B: nan - what's going on?

Post by bryanlyon »

Because they all work differently under the covers. Mixing GPUs causes multiple graphs to be created. First, the slowest GPU will be setting the pace, so the faster GPU will be waiting on the slower one all the time. Second, if the graphs happen to diverge at any point, you would get a failure like you saw. There is really no way to avoid that (except for running on a single GPU generation).

Locked