Page 1 of 1

No GPU detected (on subsequent commands - first command is fine)

Posted: Mon Jul 05, 2021 2:13 pm
by Replicon

I was running a slightly elaborate thing on a cloud instance, and it looked like the GPU disappeared between commands.

While training a model for a few hours, everything ran smoothly, then when it saved it and ran convert, I got the "no gpu detected" error:

Code: Select all

An unhandled exception occured loading pynvml. Original error: RM has detected an NVML/RM version mismatch.
No GPU detected. Switching to CPU mode

This is on a non-preemptible instance, so I assume the GPU isn't getting pulled out from under me.

I've seen it happen before where I get this error as soon as I run my first command, and I assumed it was a GCE bug where they didn't properly attach the GPU at setup... But I've lately been seeing it happen between commands (e.g. successful train, then convert goes to CPU).

Like, I've never seen it crash (or slow down significantly) midway through a round of training. This leads me to believe it's a software problem, though not sure what software. Like, is something (e.g. system update) deleting/modifying nvidia drivers, resulting in a driver version mismatch issue? Maybe that's the problem and I just need to refresh my image and/or configure it to disable all system updates.

Has anyone else encountered this?


Re: No GPU detected (on subsequent commands - first command is fine)

Posted: Mon Jul 05, 2021 2:43 pm
by Replicon

Aha, running 'nvidia-smi' also results in: 'Failed to initialize NVML: Driver/library version mismatch'

And I can see nvidia updates in /var/log/apt/history.log

Sure enough, bringing up a fresh instance and running 'sudo apt-get dist-upgrade' puts it into the bad state.

Unfortunately, even rebooting doesn't seem to get it to recover, and nvidia-smi fails with "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running."

Could this be a product of how the driver is installed in the provided image?