Search found 43 matches

by dheinz70
Sun Oct 18, 2020 1:44 am
Forum: Training Support
Topic: Distributed with Dual 2060 supers
Replies: 46
Views: 16383

Re: Distributed with Dual 2060 supers

PCIe 2.0 8x should be somewhere near 4 Gb/sec. Faceswap uses that much?

by dheinz70
Sat Oct 17, 2020 10:42 pm
Forum: Training Support
Topic: Distributed with Dual 2060 supers
Replies: 46
Views: 16383

Re: Distributed with Dual 2060 supers

The MSI site says 2x16. Other sites show 1x16 and 1x8. Which would explain why the drop down to 1x8. Well, gonna take out one of the cards and see if the single runs at 16x.

-edit-
Yep, one card shows 16x.

Screenshot from 2020-10-17 17-54-10.png
Screenshot from 2020-10-17 17-54-10.png (56.84 KiB) Viewed 6892 times

Time to start saving up for a Ryzen.....

by dheinz70
Sat Oct 17, 2020 9:30 pm
Forum: Training Support
Topic: Distributed with Dual 2060 supers
Replies: 46
Views: 16383

Re: Distributed with Dual 2060 supers

Hmmm, you might be on to something. This is showing only 8x PCIe when they are both in use. Specs on my MB show 2x 16...

Screenshot from 2020-10-17 16-22-51.png
Screenshot from 2020-10-17 16-22-51.png (81.12 KiB) Viewed 6898 times

Now with just GPU1 doing the work....

Screenshot from 2020-10-17 16-27-45.png
Screenshot from 2020-10-17 16-27-45.png (81.61 KiB) Viewed 6898 times

Still 8x

by dheinz70
Sat Oct 17, 2020 7:31 pm
Forum: Training Support
Topic: Distributed with Dual 2060 supers
Replies: 46
Views: 16383

Re: Distributed with Dual 2060 supers

Still having the issue. Check the specs, and my mb has 2 PCI Express 2.0 x16 slots. If I train on either alone gpu it screams. If I use distributed the performance is awful. Training on 1 gpu is twice as fast as training on two. So... a batch of 8 on Villain on a single gpu is giving me 27.6 EGs/sec...
by dheinz70
Mon Oct 12, 2020 10:44 pm
Forum: Training Discussion
Topic: Training Speed on Multi-GPU
Replies: 1
Views: 1400

Training Speed on Multi-GPU

Also I noticed this the other day.

Distributed with a batch of 14, and only gpu1 with a batch of 7.

Shouldn't the distributed batch of 14 have roughly 2x the EG/s of the single gpu with a batch of 7?

Screenshot from 2020-10-12 17-39-26.png
Screenshot from 2020-10-12 17-39-26.png (2.61 KiB) Viewed 1407 times
by dheinz70
Mon Oct 12, 2020 2:08 am
Forum: Training Support
Topic: Session Stats no longer appearing after a few hours of training
Replies: 15
Views: 11688

Log and graph weirdness

The Analysis tab shows more iterations than the status bar.

Also, the graph crashes or doesn't respond if you change smoothing and his the refresh button.

Screenshot from 2020-10-11 20-20-01.png
Screenshot from 2020-10-11 20-20-01.png (18.35 KiB) Viewed 11688 times
by dheinz70
Tue Oct 06, 2020 10:33 pm
Forum: Training Support
Topic: Distributed with Dual 2060 supers
Replies: 46
Views: 16383

Re: Distributed with Dual 2060 supers

Due to my snapping off the sata connector to my hdd I'm on a fresh install of Ubuntu 20.04 with the 450 driver also.

My only problem is the DFL-SAE model. (Allow growth checked everywhere it is an option)

All other models seem to work fine.

by dheinz70
Tue Oct 06, 2020 9:39 pm
Forum: Training Support
Topic: Distributed with Dual 2060 supers
Replies: 46
Views: 16383

Re: Distributed with Dual 2060 supers

After further testing it looks like all my problems come from the DFL-SAE model. It will only train on GPU1. Training on GPU0 or Distributed fail.

Training Villain with a batch of 16 right now distributed.

Thanks for all the help.

by dheinz70
Sat Oct 03, 2020 4:21 am
Forum: Training Support
Topic: Distributed with Dual 2060 supers
Replies: 46
Views: 16383

Re: Distributed with Dual 2060 supers

I notice this A LOT in the verbose logs as it starts to train.

failed to allocate 3.24G (3477464576 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

Without MIXED PRECISION I see it a lot.

With it check just one or two instances. I beginning to think this is where the issue is lying.

by dheinz70
Fri Oct 02, 2020 11:42 pm
Forum: Training Support
Topic: Distributed with Dual 2060 supers
Replies: 46
Views: 16383

Re: Distributed with Dual 2060 supers

Distributed Dlight works too.

Question: Is it normal for it to take 15-25 mins to start training on distributed? I've got a pretty beefy system, just wondering what's normal?

by dheinz70
Fri Oct 02, 2020 9:21 pm
Forum: Training Support
Topic: Distributed with Dual 2060 supers
Replies: 46
Views: 16383

Re: Distributed with Dual 2060 supers

Distributed, DFL-SAE batch 2, FAILS.

Distributed, Villain batch 4, works.

Could it be an issue with SAE and distributed?

by dheinz70
Fri Oct 02, 2020 8:09 pm
Forum: Training Support
Topic: Distributed with Dual 2060 supers
Replies: 46
Views: 16383

Re: Distributed with Dual 2060 supers

Redid alignments.

Distributed, Original, Batch 128 - worked!

Distributed, DFL-H128, batch 32 - worked!

I'll test more, but you might have figured it out. THANKS

by dheinz70
Fri Oct 02, 2020 6:44 pm
Forum: Training Support
Topic: Distributed with Dual 2060 supers
Replies: 46
Views: 16383

Re: Distributed with Dual 2060 supers

Selecting ONLY GPU0 I can get lightweight (batch32), DF128 (10) and original working.

SAE with a batch of 1 fails.

Distributed - DFL-128 with batch of 8 running. This is wierd.

I'll try the alignment thing next.

by dheinz70
Fri Oct 02, 2020 6:17 pm
Forum: Training Support
Topic: Distributed with Dual 2060 supers
Replies: 46
Views: 16383

Re: Distributed with Dual 2060 supers

Allow growth selected. Tried with batches of 2 and 1 still fails.

Attached is the system info.

by dheinz70
Fri Oct 02, 2020 4:27 pm
Forum: Training Support
Topic: Distributed with Dual 2060 supers
Replies: 46
Views: 16383

Re: Distributed with Dual 2060 supers

My displayport to dvi adapter came this morning. I plugged monitor1 into gpu0. And Gnome came up using both screens as it did with my old single card. My guess is the Nvidia driver just doesn't want to have display0 plugged into gpu0 and display1 plugged into gpu1. It wants them both plugged into gp...
by dheinz70
Fri Oct 02, 2020 7:59 am
Forum: Training Support
Topic: Distributed with Dual 2060 supers
Replies: 46
Views: 16383

Re: Distributed with Dual 2060 supers

For any linux users I opened a ticket with Nvidia. Not a faceswap issue. It is clearly a driver problem with Nvidia.

https://forums.developer.nvidia.com/t/2 ... -04/156103

by dheinz70
Fri Oct 02, 2020 5:47 am
Forum: Training Support
Topic: Distributed with Dual 2060 supers
Replies: 46
Views: 16383

Re: Distributed with Dual 2060 supers

Definitely starting to look like it is a problem with X on linux and probably not a FaceSwap issue. It doesn't like anything to do with gpu0 no matter how I have the cards (1 or 2 installed). Found some others having similar probs, any help appreciated.

by dheinz70
Fri Oct 02, 2020 4:01 am
Forum: Training Support
Topic: Distributed with Dual 2060 supers
Replies: 46
Views: 16383

Re: Distributed with Dual 2060 supers

The verbose log is too big to attach. It keeps giving

2020-10-01 22:41:27.315577: I tensorflow/stream_executor/cuda/cuda_driver.cc:763] failed to allocate 5.06G (5437426176 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-10-

by dheinz70
Fri Oct 02, 2020 3:27 am
Forum: Training Support
Topic: Distributed with Dual 2060 supers
Replies: 46
Views: 16383

Re: Distributed with Dual 2060 supers

I've been doing some playing with this. It seems that GPU0 causes any attempt to use it to fail. It keeps spitting out OOM errors. When I just select GPU1 it works like it is supposed to. I'm using Ubuntu20.04 and I think it might be related to how my 2nd monitor (connected to the GPU1) won't work p...
by dheinz70
Thu Oct 01, 2020 3:54 am
Forum: Training Support
Topic: Distributed with Dual 2060 supers
Replies: 46
Views: 16383

Re: Distributed with Dual 2060 supers

Followup if I may. Why does it take so long to start? It stays at this point for quite a long time... 09/30/2020 22:31:50 INFO batch_all_reduce: 102 all-reduces with algorithm = nccl, num_packs = 1 09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/j...