Have been training for over 18 hours - it started out like this and it's still the same... what could cause this?
Erm... Loss A: nan, Loss B: nan - what's going on?
Read the FAQs and search the forum before posting a new topic.
This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.
Please mark any answers that fixed your problems so others can find the solutions.
Re: Erm... Loss A: nan, Loss B: nan - what's going on?
Damn! I think i figured it out...
I was using Distributed training across 3090 and 2080....
Re-started it by excluding the 3090 and it's working.
FYI - for anyone who may face it
- bryanlyon
- Site Admin
- Posts: 793
- Joined: Fri Jul 12, 2019 12:49 am
- Location: San Francisco
- Has thanked: 4 times
- Been thanked: 218 times
- Contact:
Re: Erm... Loss A: nan, Loss B: nan - what's going on?
Actually, app.php/faqpage#f3r6 It's a somewhat common occurrence, though using two different generations of GPU may make it more common.
Re: Erm... Loss A: nan, Loss B: nan - what's going on?
I'm assuming that once we get to a point where 3x cards are supported, using 2 GPUs power would actually be beneficial?
- bryanlyon
- Site Admin
- Posts: 793
- Joined: Fri Jul 12, 2019 12:49 am
- Location: San Francisco
- Has thanked: 4 times
- Been thanked: 218 times
- Contact:
Re: Erm... Loss A: nan, Loss B: nan - what's going on?
You can already do multiple GPU. That's what the distributed feature is for. Mixing generations (30xx and 20xx) will probably never be worthwhile however.
- bryanlyon
- Site Admin
- Posts: 793
- Joined: Fri Jul 12, 2019 12:49 am
- Location: San Francisco
- Has thanked: 4 times
- Been thanked: 218 times
- Contact:
Re: Erm... Loss A: nan, Loss B: nan - what's going on?
Because they all work differently under the covers. Mixing GPUs causes multiple graphs to be created. First, the slowest GPU will be setting the pace, so the faster GPU will be waiting on the slower one all the time. Second, if the graphs happen to diverge at any point, you would get a failure like you saw. There is really no way to avoid that (except for running on a single GPU generation).