Distributed with Dual 2060 supers

If training is failing to start, and you are not receiving an error message telling you what to do, tell us about it here


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.

Please mark any answers that fixed your problems so others can find the solutions.

User avatar
djandg
Posts: 43
Joined: Mon Dec 09, 2019 7:00 pm
Has thanked: 4 times
Been thanked: 2 times

Re: Distributed with Dual 2060 supers

Post by djandg »

15 to 25 minutes ? I would say No its not normal. A few minutes ar most.

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

After further testing it looks like all my problems come from the DFL-SAE model. It will only train on GPU1. Training on GPU0 or Distributed fail.

Training Villain with a batch of 16 right now distributed.

Thanks for all the help.

User avatar
abigflea
Posts: 182
Joined: Sat Feb 22, 2020 10:59 pm
Answers: 2
Has thanked: 20 times
Been thanked: 62 times

Re: Distributed with Dual 2060 supers

Post by abigflea »

Ok I really don't know whats going on.
I just upgraded, now have 2x 2070 and 2x 1070
The 2070s running distributed are no problem.

FYI I had a somewhat unrelated driver issue, that snowballed when I tried to fix it (purge and reinstall)
Long story short, fresh install of Ubuntu.

They train like a charm. Nvidia 450 drivers.

Those "failed to allocate X.XG " messages are it just trying to allocate chunks of memory.
Leave Allow growth on wont hurt anything. Its an issue with NVIDIA drivers (and maybe TF) and as Bryan and torzdf has said there is 'no rhyme or reason for the requirement', and after 8 months I 100% agree.

Ill fool around with the other models and see is I can recreate your crashes.

:o I dunno what I'm doing :shock:
2X RTX 3090 : RTX 3080 : RTX: 2060 : 2x RTX 2080 Super : Ghetto 1060

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

Due to my snapping off the sata connector to my hdd I'm on a fresh install of Ubuntu 20.04 with the 450 driver also.

My only problem is the DFL-SAE model. (Allow growth checked everywhere it is an option)

All other models seem to work fine.

User avatar
abigflea
Posts: 182
Joined: Sat Feb 22, 2020 10:59 pm
Answers: 2
Has thanked: 20 times
Been thanked: 62 times

Re: Distributed with Dual 2060 supers

Post by abigflea »

revisiting this post I was wondering if you had it sorted?

I have noticed the set up time on distributed takes longer if you are using slower pcie slots.
for example, if one card is running 16x and the other is running 4x on the pcie slot.

for myself I had rearranged my hardware configuration, and one card was at pcie 8x (typical) and the other was at pcie 4x. I noticed a big slowdown.
Switched back to where the 2X 2070 were at 8x pcie and it's a very reasonable 2 min delay before training begins.

:o I dunno what I'm doing :shock:
2X RTX 3090 : RTX 3080 : RTX: 2060 : 2x RTX 2080 Super : Ghetto 1060

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

Still having the issue. Check the specs, and my mb has 2 PCI Express 2.0 x16 slots. If I train on either alone gpu it screams. If I use distributed the performance is awful. Training on 1 gpu is twice as fast as training on two. So...

a batch of 8 on Villain on a single gpu is giving me 27.6 EGs/sec

a batch of 16 on distributed ( i assume it trains 8 on each) gives me 25.1 EG/sec. Shouldn't it be roughly double the EGs of a single, minus some overhead?

It takes about 10 mins to start on distributed, 2 mins to start on a single gpu.

User avatar
abigflea
Posts: 182
Joined: Sat Feb 22, 2020 10:59 pm
Answers: 2
Has thanked: 20 times
Been thanked: 62 times

Re: Distributed with Dual 2060 supers

Post by abigflea »

I would suspect, it should be higher.
I usually get about 160% Egs of a single on Linux.

Ok, now I'm really curious.
Look in GpuZ (Windows) , or NvTop (Linux) and see what speeds that are actually communicating at, over their respective Pcie slots during training.

May say pcie x16 gen 2, or pcie x4 gen 3, something like that. The "generation" it communicates with can go up and down. Think it's some power saving feature.

May be chasing a goose but I'm curious

:o I dunno what I'm doing :shock:
2X RTX 3090 : RTX 3080 : RTX: 2060 : 2x RTX 2080 Super : Ghetto 1060

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

Hmmm, you might be on to something. This is showing only 8x PCIe when they are both in use. Specs on my MB show 2x 16...

Screenshot from 2020-10-17 16-22-51.png
Screenshot from 2020-10-17 16-22-51.png (81.12 KiB) Viewed 6778 times

Now with just GPU1 doing the work....

Screenshot from 2020-10-17 16-27-45.png
Screenshot from 2020-10-17 16-27-45.png (81.61 KiB) Viewed 6778 times

Still 8x

User avatar
abigflea
Posts: 182
Joined: Sat Feb 22, 2020 10:59 pm
Answers: 2
Has thanked: 20 times
Been thanked: 62 times

Re: Distributed with Dual 2060 supers

Post by abigflea »

8x doesn't surprise me, but only gen 2 does.

Mine all run at pcie 8x gen 3. ( Or if I just put a single card then the top one will run x16)

Need a extra fancy mb to support gen 3 at pcie x16 on all slots.

:o I dunno what I'm doing :shock:
2X RTX 3090 : RTX 3080 : RTX: 2060 : 2x RTX 2080 Super : Ghetto 1060

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

The MSI site says 2x16. Other sites show 1x16 and 1x8. Which would explain why the drop down to 1x8. Well, gonna take out one of the cards and see if the single runs at 16x.

-edit-
Yep, one card shows 16x.

Screenshot from 2020-10-17 17-54-10.png
Screenshot from 2020-10-17 17-54-10.png (56.84 KiB) Viewed 6772 times

Time to start saving up for a Ryzen.....

User avatar
abigflea
Posts: 182
Joined: Sat Feb 22, 2020 10:59 pm
Answers: 2
Has thanked: 20 times
Been thanked: 62 times

Re: Distributed with Dual 2060 supers

Post by abigflea »

If its that, Then yea, get yourself a nice B550 or x570 chipset that screams on Pcie.
Im going to test this real quick, I can force mine to 4x

:o I dunno what I'm doing :shock:
2X RTX 3090 : RTX 3080 : RTX: 2060 : 2x RTX 2080 Super : Ghetto 1060

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

PCIe 2.0 8x should be somewhere near 4 Gb/sec. Faceswap uses that much?

User avatar
abigflea
Posts: 182
Joined: Sat Feb 22, 2020 10:59 pm
Answers: 2
Has thanked: 20 times
Been thanked: 62 times

Re: Distributed with Dual 2060 supers

Post by abigflea »

Ok , this was for some reason a painful test. Doesn't help with the startup ideas so much but.

Question: Are you increasing your batch size when using distributed? Should allow roughly a 85% higher batch and EGs/sec goes up.. I'm grasping at straws with this one.

Test Results: Villain mode, 2X2070, Batch=26, 950 iter per test.

Code: Select all

PciE Lanes @ Gen         Startup delay       EGs/sec
______________________________________________________
8x8 @Gen 3                   131 sec           60.4
4x4 @Gen 3                   145 sec           51.8
4x4 @Gen 2                   144 sec           42.0
4X4 @Gen 1                   144 sec           29.7

So Gen speed or lanes didn't seem to impact distributed startup time, but sure does slow down training.
Training will be impacted by the slowest card.
Sure different MB chipsets/models/settings/voodoo/temperature throttling will change the above numbers.

:o I dunno what I'm doing :shock:
2X RTX 3090 : RTX 3080 : RTX: 2060 : 2x RTX 2080 Super : Ghetto 1060

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

Yes, I do lower to 80% of what one can handle. It isn't just a startup issue. Depending on the model it can take 5-10 mins to start. It just runs really slowly once it starts. In terms of EG/s I'm getting better performance from 1 gpu doing 1/2 as many at a time.

I'm watching nvtop with the single 2060 villian batch of 10 and I see rx/tx in the 200 MB/s range getting 32 EG/s.

Will test batch of 16 on both......

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

Did a couple thousand iterations to test.

Screenshot from 2020-10-17 21-54-05.png
Screenshot from 2020-10-17 21-54-05.png (648 Bytes) Viewed 6763 times

A singe gpu batch of 10, dual with a batch of 16. I'm getting better EGs/sec from the single gpu. It took 4 mins to start on distributed. 1.5 mins to start on single gpu.

I watch nvtop and I never saw the rx/tx get pegged at the theoretical 4 Gb/sec limit of a PCIE 2.0 8x transfer speed. I did see a few instances where I approached 4 Bg/sec but it never stayed there for more than a blink of an eye.

Screenshot from 2020-10-17 21-52-37.png
Screenshot from 2020-10-17 21-52-37.png (80.21 KiB) Viewed 6763 times

I don't think I'm overloading the FSB or Northbridge. Any ideas?

User avatar
abigflea
Posts: 182
Joined: Sat Feb 22, 2020 10:59 pm
Answers: 2
Has thanked: 20 times
Been thanked: 62 times

Re: Distributed with Dual 2060 supers

Post by abigflea »

With your setup, a 4 min startup vs my 2.3 min startup sounds reasonable. I have a x470 chipset.

Doesn't rx/tx non stop at max rate every time NVtop displays sample, sure it jumps around.
Although, most looked like this.

Screenshot from 2020-10-18 05-20-33.png
Screenshot from 2020-10-18 05-20-33.png (101.17 KiB) Viewed 6749 times

I feel the need to take care of what I'm willing to say.
If I personally had 2 computers, with absolutely identical software setups, I may be feeling training speed would be hindered by hardware on distributed.

When I've tried 2x 1070 distributed , with one on a Pcie x1 slot, it took FOREVER to start training, which then was very slow (5 EGs/sec). If I used the single card on a PCIE x1 slot... it wasn't terrible. In fact, faster than distributed.

A positive thing I can mention: If your training at lets say 30EGS/s Batch 10 vs 30EGs/s Batch 20 the higher batch model should learn better. Less iterations/time to get to the same quality.

You seem to be using FaceSwap fine, do you feel there is any other possible software considerations we could be missing before making a hardware conclusion?

Do you think your cards are throttling due to heat? Mine don't go over 71 C

:o I dunno what I'm doing :shock:
2X RTX 3090 : RTX 3080 : RTX: 2060 : 2x RTX 2080 Super : Ghetto 1060

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

I added the coolbits to my xorg.conf, so I could control the fans on the cards. I cranked them all to 100% and the cards ran at 55C. Same slow results on distributed. The nvidia control panel lists 93C as the slowdown temp. Batch of 12 gave me 23 EG/s.

Screenshot from 2020-10-18 12-41-24.png
Screenshot from 2020-10-18 12-41-24.png (29.85 KiB) Viewed 6734 times

I'm going to test small batches. If it is a hardware bottleneck I'll run a batch of 2 distributed and a batch of 2 single for a while.

If it is the hardware such a small batch should stay well underneath the limits of my machine.

User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

2000 iterations of a batch of 2 each on single and distributed. I doubt it was ever enuf data to clog the pipes. My feeling is there is some hardware limitation, but i suspect there is something else going on too.

Screenshot from 2020-10-18 13-28-38.png
Screenshot from 2020-10-18 13-28-38.png (892 Bytes) Viewed 6733 times
User avatar
dheinz70
Posts: 43
Joined: Sat Aug 15, 2020 2:43 am
Has thanked: 4 times

Re: Distributed with Dual 2060 supers

Post by dheinz70 »

Another test with the same results. 2000 iterations on Original, single batch 150, distributed batch 300. Distributed is almost exactly half as efficient.

Screenshot from 2020-10-19 15-49-51.png
Screenshot from 2020-10-19 15-49-51.png (1.15 KiB) Viewed 6714 times

Watching the nvtop the stats never pegged and held there. avgeraged about 70% of what the gpus could handle. So it is unlikely things were overloaded.

Screenshot from 2020-10-19 14-54-17.png
Screenshot from 2020-10-19 14-54-17.png (60.05 KiB) Viewed 6714 times
User avatar
torzdf
Posts: 2649
Joined: Fri Jul 12, 2019 12:53 am
Answers: 159
Has thanked: 128 times
Been thanked: 622 times

Re: Distributed with Dual 2060 supers

Post by torzdf »

I can only assume that this is hardware related, as this is not a widely reported issue.

My word is final

Locked