I experimented with a variety of GPUs, and thought I'd share my numbers.
I basically just ran 10K iterations of "original" with batch size 16. I figured it's representative enough for entry level stuff, but I'm sure things change once you get into bigger/fancier models.
Anyway, at least for these settings, it kind of looks like performance differences between the GPUs are negligible compared to the cost differences of using them. Sure, a P100 performs better, but a T4 is so much cheaper that it overtakes the performance benefits.
Also, looks like having multiple GPUs doesn't really help at all... which isn't too surprising, since iterations are sequential, and likely aren't very distributable.
Anyway, just for funzies, here's the numbers.
I'd be really curious about other people's experiences with other configurations... and whether A100 is all that.