Training runs painfully slow on V100

If training is failing to start, and you are not receiving an error message telling you what to do, tell us about it here


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
koko191
Posts: 3
Joined: Tue Aug 27, 2019 4:17 pm
Answers: 1

Training runs painfully slow on V100

Post by koko191 »

For some reason my training runs super slow on V100 even though it runs like a mad lad on a Titan RTX. I'm using the same settings for both GPU. When I check nvidia-smi, the script was using 12+GB of VRAM on the Titan RTX, but always 305MB on the V100. Am I doing something wrong? The V100, according to what I know, should be a lot faster than any other GPU.

by koko191 » Tue Aug 27, 2019 7:21 pm

Solved.

Both machines had CUDA 10.1 (and other versions as well) but the RTX had cuDNN 7.5.1 (compatible with CUDA 10.1) while the V100 had only cuDNN 7.1.4 (incompatible with CUDA 10.1). I was using TF 1.14.0. Downgraded to 1.12.3 (compatible with CUDA 9.2, which both machines had, and CUDA 9.2 is compatible with cuDNN 7.1.4) and everything worked.

Go to full post
User avatar
koko191
Posts: 3
Joined: Tue Aug 27, 2019 4:17 pm
Answers: 1

Re: Training runs painfully slow on V100

Post by koko191 »

Solved.

Both machines had CUDA 10.1 (and other versions as well) but the RTX had cuDNN 7.5.1 (compatible with CUDA 10.1) while the V100 had only cuDNN 7.1.4 (incompatible with CUDA 10.1). I was using TF 1.14.0. Downgraded to 1.12.3 (compatible with CUDA 9.2, which both machines had, and CUDA 9.2 is compatible with cuDNN 7.1.4) and everything worked.

Locked