No GPU detected (on subsequent commands - first command is fine)

Want to use Faceswap in The Cloud? This is not directly supported by the Devs, but you may find community support here


Forum rules

Read the FAQs and search the forum before posting a new topic.

NB: The Devs do not directly support using Cloud based services, but you can find community support here.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
Replicon
Posts: 50
Joined: Mon Mar 22, 2021 4:24 pm
Been thanked: 2 times

No GPU detected (on subsequent commands - first command is fine)

Post by Replicon »

I was running a slightly elaborate thing on a cloud instance, and it looked like the GPU disappeared between commands.

While training a model for a few hours, everything ran smoothly, then when it saved it and ran convert, I got the "no gpu detected" error:

Code: Select all

An unhandled exception occured loading pynvml. Original error: RM has detected an NVML/RM version mismatch.
No GPU detected. Switching to CPU mode

This is on a non-preemptible instance, so I assume the GPU isn't getting pulled out from under me.

I've seen it happen before where I get this error as soon as I run my first command, and I assumed it was a GCE bug where they didn't properly attach the GPU at setup... But I've lately been seeing it happen between commands (e.g. successful train, then convert goes to CPU).

Like, I've never seen it crash (or slow down significantly) midway through a round of training. This leads me to believe it's a software problem, though not sure what software. Like, is something (e.g. system update) deleting/modifying nvidia drivers, resulting in a driver version mismatch issue? Maybe that's the problem and I just need to refresh my image and/or configure it to disable all system updates.

Has anyone else encountered this?

User avatar
Replicon
Posts: 50
Joined: Mon Mar 22, 2021 4:24 pm
Been thanked: 2 times

Re: No GPU detected (on subsequent commands - first command is fine)

Post by Replicon »

Aha, running 'nvidia-smi' also results in: 'Failed to initialize NVML: Driver/library version mismatch'

And I can see nvidia updates in /var/log/apt/history.log

Sure enough, bringing up a fresh instance and running 'sudo apt-get dist-upgrade' puts it into the bad state.

Unfortunately, even rebooting doesn't seem to get it to recover, and nvidia-smi fails with "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running."

Could this be a product of how the driver is installed in the provided image?

Locked