Page 1 of 1

Erm... Loss A: nan, Loss B: nan - what's going on?

Posted: Tue Jan 26, 2021 4:35 am
by yking90

Have been training for over 18 hours - it started out like this and it's still the same... what could cause this?

Image


Re: Erm... Loss A: nan, Loss B: nan - what's going on?

Posted: Tue Jan 26, 2021 7:03 am
by yking90

Damn! I think i figured it out...

I was using Distributed training across 3090 and 2080....

Re-started it by excluding the 3090 and it's working.

FYI - for anyone who may face it :)


Re: Erm... Loss A: nan, Loss B: nan - what's going on?

Posted: Tue Jan 26, 2021 6:33 pm
by bryanlyon

Actually, app.php/faqpage#f3r6 It's a somewhat common occurrence, though using two different generations of GPU may make it more common.


Re: Erm... Loss A: nan, Loss B: nan - what's going on?

Posted: Wed Jan 27, 2021 6:58 pm
by yking90

I'm assuming that once we get to a point where 3x cards are supported, using 2 GPUs power would actually be beneficial?


Re: Erm... Loss A: nan, Loss B: nan - what's going on?

Posted: Wed Jan 27, 2021 7:22 pm
by bryanlyon

You can already do multiple GPU. That's what the distributed feature is for. Mixing generations (30xx and 20xx) will probably never be worthwhile however.


Re: Erm... Loss A: nan, Loss B: nan - what's going on?

Posted: Wed Jan 27, 2021 8:23 pm
by yking90

Could you explain why so?


Re: Erm... Loss A: nan, Loss B: nan - what's going on?

Posted: Thu Jan 28, 2021 3:42 pm
by bryanlyon

Because they all work differently under the covers. Mixing GPUs causes multiple graphs to be created. First, the slowest GPU will be setting the pace, so the faster GPU will be waiting on the slower one all the time. Second, if the graphs happen to diverge at any point, you would get a failure like you saw. There is really no way to avoid that (except for running on a single GPU generation).