For some reason my training runs super slow on V100 even though it runs like a mad lad on a Titan RTX. I'm using the same settings for both GPU. When I check nvidia-smi
, the script was using 12+GB of VRAM on the Titan RTX, but always 305MB on the V100. Am I doing something wrong? The V100, according to what I know, should be a lot faster than any other GPU.
Training runs painfully slow on V100
Read the FAQs and search the forum before posting a new topic.
This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.
Please mark any answers that fixed your problems so others can find the solutions.
Training runs painfully slow on V100
Solved.
Both machines had CUDA 10.1 (and other versions as well) but the RTX had cuDNN 7.5.1 (compatible with CUDA 10.1) while the V100 had only cuDNN 7.1.4 (incompatible with CUDA 10.1). I was using TF 1.14.0. Downgraded to 1.12.3 (compatible with CUDA 9.2, which both machines had, and CUDA 9.2 is compatible with cuDNN 7.1.4) and everything worked.
Re: Training runs painfully slow on V100
Solved.
Both machines had CUDA 10.1 (and other versions as well) but the RTX had cuDNN 7.5.1 (compatible with CUDA 10.1) while the V100 had only cuDNN 7.1.4 (incompatible with CUDA 10.1). I was using TF 1.14.0. Downgraded to 1.12.3 (compatible with CUDA 9.2, which both machines had, and CUDA 9.2 is compatible with cuDNN 7.1.4) and everything worked.