Distributed with Dual 2060 supers

If training is failing to start, and you are not receiving an error message telling you what to do, tell us about it here


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.

Please mark any answers that fixed your problems so others can find the solutions.

User avatar
bryanlyon
Site Admin
Posts: 793
Joined: Fri Jul 12, 2019 12:49 am
Answers: 44
Location: San Francisco
Has thanked: 4 times
Been thanked: 218 times
Contact:

Re: Distributed with Dual 2060 supers

Post by bryanlyon »

I can say that yes, FS can EASILY saturate 4gb/sec if you're using distributed. It has to sync across the entire gradient set every batch as well as the actual model. Basically assume that one entire card's GPU ram gets sent to the other card every single batch. At 27eg/sec as BS 16 you're syncing across the card ABOUT 1.5x per second, which makes perfect sense to me.

There are really only 2 possibilities. Drivers and hardware. Hardware does seem the most reasonable. Can you tell me EXACTLY what model of motherboard you have? If it's not a motherboard designed for multiple GPUs it's quite possible that the 2nd GPU is running off the southbridge which could SEVERELY impact operation. You MIGHT be able to use an NVLink bridge to improve the card to card communication but I see conflicting information on whether the 2060 super supports nvlink.

Drivers can be VERY finicky when using low end cards in multiGPU systems. Several users can tell you about how they've struggled to use multiple GPUs while one GPU would work fine. I would recommend trying the driver included in the direct download of Cuda as that tends to be the most stable to me, BUT others have reported the opposite and it requires removing any existing drivers which could cause other issues.

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

The fact it drops down to 8x8 tells me it is probably mostly hardware. Just thought it was weird that running two cards is almost exactly half as productive.

Screenshot from 2020-10-19 18-49-07.png
Screenshot from 2020-10-19 18-49-07.png (35.19 KiB) Viewed 2957 times
User avatar
bryanlyon
Site Admin
Posts: 793
Joined: Fri Jul 12, 2019 12:49 am
Answers: 44
Location: San Francisco
Has thanked: 4 times
Been thanked: 218 times
Contact:

Re: Distributed with Dual 2060 supers

Post by bryanlyon »

Yes, this motherboard only has PCI-E Gen 2. That is a HUGE bottleneck and is almost definitely the cause of your problems. I'd say that unless you can get NVLink working, you're only going to get reasonable speeds from 1x gpu. If you're willing to spend some money you might be able to get the NVLink to work which will move that GPU-GPU communication over to that adapter but may or may not work on your cards.

I think it's important to note that your CPU is more than powerful enough and you might even be able to run 2 separate trainings at once (one on each card).

But yeah, that motherboard is definitely the cause of your issues.

User avatar
abigflea
Posts: 182
Joined: Sat Feb 22, 2020 10:59 pm
Answers: 2
Has thanked: 20 times
Been thanked: 62 times

Re: Distributed with Dual 2060 supers

Post by abigflea »

Oh! thats almost my old computer. Had a 8300 and a 990fx mainboard.
Yea FS really didn't care for any multi gpu stuff.

I was able to do 2 separate training sessions just fine. Had 16Gb ram.

:o I dunno what I'm doing :shock:
2X RTX 3090 : RTX 3080 : RTX: 2060 : 2x RTX 2080 Super : Ghetto 1060

User avatar
sainivedant41
Posts: 1
Joined: Sat Dec 26, 2020 12:20 pm

Re: Distributed with Dual 2060 supers

Post by sainivedant41 »

dheinz70 wrote: Sat Oct 17, 2020 7:31 pm

Still having the issue. Check the specs, and my mb has 2 PCI Express 2.0 x16 slots. If I train on either alone gpu it screams. If I use distributed the performance is awful. Training on 1 gpu is twice as fast as training on two. So...

a batch of 8 on Villain on a single gpu is giving me 27.6 EGs/sec

a batch of 16 on distributed ( i assume it trains 8 on each) gives me 25.1 EG/sec. Shouldn't it be roughly double the EGs of a single, minus some overhead?

It takes about 10 mins to start on distributed, 2 mins to start on a single gpu.

Damm even I am having a similar kind of issue, I have searched all over the internet and even have posted on number of threads on different forum, no solution seems to work. I am really frustrated, can anyone of you here help me resolve this issue, I am very much tired now.

User avatar
bryanlyon
Site Admin
Posts: 793
Joined: Fri Jul 12, 2019 12:49 am
Answers: 44
Location: San Francisco
Has thanked: 4 times
Been thanked: 218 times
Contact:

Re: Distributed with Dual 2060 supers

Post by bryanlyon »

sainivedant41 wrote: Sat Dec 26, 2020 1:39 pm
dheinz70 wrote: Sat Oct 17, 2020 7:31 pm

Still having the issue. Check the specs, and my mb has 2 PCI Express 2.0 x16 slots. If I train on either alone gpu it screams. If I use distributed the performance is awful. Training on 1 gpu is twice as fast as training on two. So...

a batch of 8 on Villain on a single gpu is giving me 27.6 EGs/sec

a batch of 16 on distributed ( i assume it trains 8 on each) gives me 25.1 EG/sec. Shouldn't it be roughly double the EGs of a single, minus some overhead?

It takes about 10 mins to start on distributed, 2 mins to start on a single gpu.

Damm even I am having a similar kind of issue, I have searched all over the internet and even have posted on number of threads on different forum, no solution seems to work. I am really frustrated, can anyone of you here help me resolve this issue, I am very much tired now.

It's likely the same problem as the previous person in this thread. Multiple GPU requires a massive amount of communication between the GPUs and so has a heavy dependence on your hardware for speed. If you want us to try to diagnose it, start with your speeds and as much information as possible on your hardware configuration so we can see what might be the problem.

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

My 2 cents on using two GPUs is that you really, really need a high end motherboard. I upgraded to a Ryzen 7 and a x570 mobo and dual GPUs are still flaky. I'm using Linux and I get maybe about a 20% increase in EG/s using dual 2060 supers over just using one. SO... 2 GPUs with batch of 16 only runs 20% more EG/s than a single running a batch of 8. In my book that really isn't enuf to validate buying a second card.

My suggestion to anyone who doesn't have a VERY high end computer is spend your money on ONE card, especially if this is just a hobby. YMMV.

Locked