Some benchmarking numbers

Replicon · Post by **Replicon** » Sat Apr 17, 2021 5:08 am

I experimented with a variety of GPUs, and thought I'd share my numbers.

I basically just ran 10K iterations of "original" with batch size 16. I figured it's representative enough for entry level stuff, but I'm sure things change once you get into bigger/fancier models.

Anyway, at least for these settings, it kind of looks like performance differences between the GPUs are negligible compared to the cost differences of using them. Sure, a P100 performs better, but a T4 is so much cheaper that it overtakes the performance benefits.

Also, looks like having multiple GPUs doesn't really help at all... which isn't too surprising, since iterations are sequential, and likely aren't very distributable.

Anyway, just for funzies, here's the numbers.

I'd be really curious about other people's experiences with other configurations... and whether A100 is all that.

sp13 · Post by **sp13** » Sat Apr 17, 2021 5:09 pm

Interesting. I was wondering how the P100 would perform compared to the T4.
So far my experiments have been with a T4 on n1-standard-2. It doesn't seem to be having CPU or system RAM limitations and could save like 10 cents an hour

sp13 · Post by **sp13** » Mon Apr 19, 2021 7:48 pm

I'm curious about the poor V100 numbers you were seeing.

I haven't ran any specific benchmarks, but yesterday I was training a realface model on a n1-standard-2 + T4. I was getting around 78 EG/s (batchsize = 32, mixed precision enabled) with CPU utilization about 69%.

Today I wanted to redo the model with different coverage, but I couldn't get a T4. So I am currently running on a n1-standard-2 + V100, same settings (except coverage %), same data, and getting about 126 EG/s. CPU is at about 96% as the 2 little cores try to keep up.

UPDATE: I changed the V100 to a n1-standard-4 and got CPU usage down below 80% and training up to 195 EG/s. So the extra cores are worth it in this case.

The V100 performance increase is not worth the much greater cost but there is a performance increase.

(BTW, this model will train at 11 EG/s on my GTX 1060 with batchsize = 3)

Replicon · Post by **Replicon** » Sat May 01, 2021 3:31 pm

Thanks for the tip, will have to try with a n1-standard-2. I use preemptible hardware, so savings would add up to 2 cents per hour, but it's free to change my script's config.

sp13 · Post by **sp13** » Sat May 01, 2021 9:20 pm

Yeah, I always use preempt when I can. The particular day that I ran the V100 I couldn't get a T4 even at non-preempt price. I haven't had a problem since.

I stopped my realface experiments because they didn't play nice with my little 6 GB GTX 1060. But villain with mixed precision is very nice on a T4. I get a training rate 2.5 times higher than my 1060 due to larger batch size and tensor cores. I will test it on a V100 if I end up with some extra credits.

I also did a little experiment just to see how bad conversion (not training) is on CPU only. Not surprisingly, it's bad. Using a T4 was 25 times faster than nd2-highcpu-8 for about twice the price.

Oh, also you if don't get an external IP address it will still let you tunnel in for SSH and you can save like 2 tenths of a cent per hour.

Replicon · Post by **Replicon** » Sun May 02, 2021 8:24 pm

Haha Money$$$$$

I should experiment with Villain, it seems like it's the state of the art right now. I wonder if you get better bang for your buck for running it overnight, vs with "original" as I've been doing. In the comparison thread, even lightweight is looking pretty good haha. Maybe I can stage a T4 race between the two. It's probably a more fair comparison to keep "time spent running" the same, rather than "number of iterations", since time is the real commodity.

sp13 · Post by **sp13** » Mon May 03, 2021 8:04 pm

I did have some spare credits so I've been playing with a V100 some more. Using something similar to Villain I get about 2.5 times the performance of the T4, which is exactly what my Realface experiments showed.

The V100 is a hungry beast. On an n1-standard-4, faceswap averaged about 333% CPU so not maxing out but some threads looked like they could use help. I'm currently trying a custom config with 6 cores and 12 GB RAM and get about an extra 5% performance.

Still not worth it at 5 to 6 times the cost of a T4 but it is fun.

Faceswap Forum

Some benchmarking numbers

Some benchmarking numbers

Re: Some benchmarking numbers

Re: Some benchmarking numbers

Re: Some benchmarking numbers

Re: Some benchmarking numbers

Re: Some benchmarking numbers

Re: Some benchmarking numbers