Distributed with Dual 2060 supers

djandg · Post by **djandg** » Sun Oct 04, 2020 10:29 am

15 to 25 minutes ? I would say No its not normal. A few minutes ar most.

dheinz70 · Post by **dheinz70** » Tue Oct 06, 2020 9:39 pm

After further testing it looks like all my problems come from the DFL-SAE model. It will only train on GPU1. Training on GPU0 or Distributed fail.

Training Villain with a batch of 16 right now distributed.

Thanks for all the help.

Post by **abigflea** » Tue Oct 06, 2020 10:27 pm

Ok I really don't know whats going on.
I just upgraded, now have 2x 2070 and 2x 1070
The 2070s running distributed are no problem.

FYI I had a somewhat unrelated driver issue, that snowballed when I tried to fix it (purge and reinstall)
Long story short, fresh install of Ubuntu.

They train like a charm. Nvidia 450 drivers.

Those "failed to allocate X.XG " messages are it just trying to allocate chunks of memory.
Leave Allow growth on wont hurt anything. Its an issue with NVIDIA drivers (and maybe TF) and as Bryan and torzdf has said there is 'no rhyme or reason for the requirement', and after 8 months I 100% agree.

Ill fool around with the other models and see is I can recreate your crashes.

dheinz70 · Post by **dheinz70** » Tue Oct 06, 2020 10:33 pm

Due to my snapping off the sata connector to my hdd I'm on a fresh install of Ubuntu 20.04 with the 450 driver also.

My only problem is the DFL-SAE model. (Allow growth checked everywhere it is an option)

All other models seem to work fine.

Post by **abigflea** » Sat Oct 17, 2020 9:13 am

revisiting this post I was wondering if you had it sorted?

I have noticed the set up time on distributed takes longer if you are using slower pcie slots.
for example, if one card is running 16x and the other is running 4x on the pcie slot.

for myself I had rearranged my hardware configuration, and one card was at pcie 8x (typical) and the other was at pcie 4x. I noticed a big slowdown.
Switched back to where the 2X 2070 were at 8x pcie and it's a very reasonable 2 min delay before training begins.

dheinz70 · Post by **dheinz70** » Sat Oct 17, 2020 7:31 pm

Still having the issue. Check the specs, and my mb has 2 PCI Express 2.0 x16 slots. If I train on either alone gpu it screams. If I use distributed the performance is awful. Training on 1 gpu is twice as fast as training on two. So...

a batch of 8 on Villain on a single gpu is giving me 27.6 EGs/sec

a batch of 16 on distributed ( i assume it trains 8 on each) gives me 25.1 EG/sec. Shouldn't it be roughly double the EGs of a single, minus some overhead?

It takes about 10 mins to start on distributed, 2 mins to start on a single gpu.

Post by **abigflea** » Sat Oct 17, 2020 9:00 pm

I would suspect, it should be higher.
I usually get about 160% Egs of a single on Linux.

Ok, now I'm really curious.
Look in GpuZ (Windows) , or NvTop (Linux) and see what speeds that are actually communicating at, over their respective Pcie slots during training.

May say pcie x16 gen 2, or pcie x4 gen 3, something like that. The "generation" it communicates with can go up and down. Think it's some power saving feature.

May be chasing a goose but I'm curious

dheinz70 · Post by **dheinz70** » Sat Oct 17, 2020 9:30 pm

Hmmm, you might be on to something. This is showing only 8x PCIe when they are both in use. Specs on my MB show 2x 16...

: Screenshot from 2020-10-17 16-22-51.png (81.12 KiB) Viewed 6851 times

Now with just GPU1 doing the work....

: Screenshot from 2020-10-17 16-27-45.png (81.61 KiB) Viewed 6851 times

Still 8x

Post by **abigflea** » Sat Oct 17, 2020 10:28 pm

8x doesn't surprise me, but only gen 2 does.

Mine all run at pcie 8x gen 3. ( Or if I just put a single card then the top one will run x16)

Need a extra fancy mb to support gen 3 at pcie x16 on all slots.

dheinz70 · Post by **dheinz70** » Sat Oct 17, 2020 10:42 pm

The MSI site says 2x16. Other sites show 1x16 and 1x8. Which would explain why the drop down to 1x8. Well, gonna take out one of the cards and see if the single runs at 16x.

-edit-
Yep, one card shows 16x.

: Screenshot from 2020-10-17 17-54-10.png (56.84 KiB) Viewed 6845 times

Time to start saving up for a Ryzen.....

Post by **abigflea** » Sat Oct 17, 2020 11:53 pm

If its that, Then yea, get yourself a nice B550 or x570 chipset that screams on Pcie.
Im going to test this real quick, I can force mine to 4x

dheinz70 · Post by **dheinz70** » Sun Oct 18, 2020 1:44 am

PCIe 2.0 8x should be somewhere near 4 Gb/sec. Faceswap uses that much?

Post by **abigflea** » Sun Oct 18, 2020 2:03 am

Ok , this was for some reason a painful test. Doesn't help with the startup ideas so much but.

Question: Are you increasing your batch size when using distributed? Should allow roughly a 85% higher batch and EGs/sec goes up.. I'm grasping at straws with this one.

Test Results: Villain mode, 2X2070, Batch=26, 950 iter per test.

Code: Select all

PciE Lanes @ Gen         Startup delay       EGs/sec
______________________________________________________
8x8 @Gen 3                   131 sec           60.4
4x4 @Gen 3                   145 sec           51.8
4x4 @Gen 2                   144 sec           42.0
4X4 @Gen 1                   144 sec           29.7

So Gen speed or lanes didn't seem to impact distributed startup time, but sure does slow down training.
Training will be impacted by the slowest card.
Sure different MB chipsets/models/settings/voodoo/temperature throttling will change the above numbers.

dheinz70 · Post by **dheinz70** » Sun Oct 18, 2020 2:09 am

Yes, I do lower to 80% of what one can handle. It isn't just a startup issue. Depending on the model it can take 5-10 mins to start. It just runs really slowly once it starts. In terms of EG/s I'm getting better performance from 1 gpu doing 1/2 as many at a time.

I'm watching nvtop with the single 2060 villian batch of 10 and I see rx/tx in the 200 MB/s range getting 32 EG/s.

Will test batch of 16 on both......

dheinz70 · Post by **dheinz70** » Sun Oct 18, 2020 3:00 am

Did a couple thousand iterations to test.

: Screenshot from 2020-10-17 21-54-05.png (648 Bytes) Viewed 6836 times

A singe gpu batch of 10, dual with a batch of 16. I'm getting better EGs/sec from the single gpu. It took 4 mins to start on distributed. 1.5 mins to start on single gpu.

I watch nvtop and I never saw the rx/tx get pegged at the theoretical 4 Gb/sec limit of a PCIE 2.0 8x transfer speed. I did see a few instances where I approached 4 Bg/sec but it never stayed there for more than a blink of an eye.

: Screenshot from 2020-10-17 21-52-37.png (80.21 KiB) Viewed 6836 times

I don't think I'm overloading the FSB or Northbridge. Any ideas?

Post by **abigflea** » Sun Oct 18, 2020 9:46 am

With your setup, a 4 min startup vs my 2.3 min startup sounds reasonable. I have a x470 chipset.

Doesn't rx/tx non stop at max rate every time NVtop displays sample, sure it jumps around.
Although, most looked like this.

: Screenshot from 2020-10-18 05-20-33.png (101.17 KiB) Viewed 6822 times

I feel the need to take care of what I'm willing to say.
If I personally had 2 computers, with absolutely identical software setups, I may be feeling training speed would be hindered by hardware on distributed.

When I've tried 2x 1070 distributed , with one on a Pcie x1 slot, it took FOREVER to start training, which then was very slow (5 EGs/sec). If I used the single card on a PCIE x1 slot... it wasn't terrible. In fact, faster than distributed.

A positive thing I can mention: If your training at lets say 30EGS/s Batch 10 vs 30EGs/s Batch 20 the higher batch model should learn better. Less iterations/time to get to the same quality.

You seem to be using FaceSwap fine, do you feel there is any other possible software considerations we could be missing before making a hardware conclusion?

Do you think your cards are throttling due to heat? Mine don't go over 71 C

dheinz70 · Post by **dheinz70** » Sun Oct 18, 2020 5:41 pm

I added the coolbits to my xorg.conf, so I could control the fans on the cards. I cranked them all to 100% and the cards ran at 55C. Same slow results on distributed. The nvidia control panel lists 93C as the slowdown temp. Batch of 12 gave me 23 EG/s.

: Screenshot from 2020-10-18 12-41-24.png (29.85 KiB) Viewed 6807 times

I'm going to test small batches. If it is a hardware bottleneck I'll run a batch of 2 distributed and a batch of 2 single for a while.

If it is the hardware such a small batch should stay well underneath the limits of my machine.

dheinz70 · Post by **dheinz70** » Sun Oct 18, 2020 6:34 pm

2000 iterations of a batch of 2 each on single and distributed. I doubt it was ever enuf data to clog the pipes. My feeling is there is some hardware limitation, but i suspect there is something else going on too.

: Screenshot from 2020-10-18 13-28-38.png (892 Bytes) Viewed 6806 times

dheinz70 · Post by **dheinz70** » Mon Oct 19, 2020 8:54 pm

Another test with the same results. 2000 iterations on Original, single batch 150, distributed batch 300. Distributed is almost exactly half as efficient.

: Screenshot from 2020-10-19 15-49-51.png (1.15 KiB) Viewed 6787 times

Watching the nvtop the stats never pegged and held there. avgeraged about 70% of what the gpus could handle. So it is unlikely things were overloaded.

: Screenshot from 2020-10-19 14-54-17.png (60.05 KiB) Viewed 6787 times

Post by **torzdf** » Mon Oct 19, 2020 10:05 pm

I can only assume that this is hardware related, as this is not a widely reported issue.

Faceswap Forum

Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers

Re: Distributed with Dual 2060 supers