[SOLVED] Nvidia driver stopped working out of nowhere on Ubuntu server 22.04

Koma52@lemmy.world to Linux@lemmy.ml – 13 points –

Hello everyone,

Today the nvidia driver on my server stopped working out of nowhere. Yesterday it was working and today it's not. I didn't do anything in yesterday or today.

Today my Plex container stopped working because there was a problem with the nvidia card I was using for transcoding. It's a GTX 1650.

I tried running nvidia-smi and it said Failed to initialize NVML: Driver/library version mismatch. After I tried upgrading my system because it was a months ago I upgraded, maybe it will help. It didn't. I tried some rebooting because some sources said it solves the issue but it persisted.

It's driver reinstall time. Purged the driver with apt purge nvidia* then installed driver with ubuntu-drivers install --gpgpu nvidia:525-server. After reboot nvidia-smi gives the error NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running..

lsmod | grep nvidia shows nothing and /proc/driver/nvidia/version doesn't exists. I tried starting nvidia-persistenced with systemctl but it gives this error:

Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 113 has read and write permissions for those files.

/dev/nvidia* doesn't exist.

I'm very noobish when it comes to nvidia and linux it was a pain to set it up initially and I was hoping that it wouldn't go wrong someday. But here I am unfortunatelly. I don't really know what logs should I show you or what commands should I run to troubleshoot so every tip is appreciated and I will provide logs and things like that if needed.

System info:

  • Ubuntu Server 22.04
  • kernel: 5.15.0-76-generic
  • theoretically installed nvidia driver: nvidia-driver-525-server

Solution

I was using the ubuntu-drivers utility to install the driver but turns out it's not that great. After installing with the manual method from https://help.ubuntu.com/community/NvidiaDriversInstallation using the command apt install linux-modules-nvidia-${DRIVER_BRANCH}${SERVER}-${LINUX_FLAVOUR} it's working again.

6

You are viewing a single comment

Shows up in lspci. Booting a live OS would be a little bit tricky because it's in a wall mounted rack but I will try that if nothing else works. Thank you.

So it sees the hardware, but the kernel module isn't being loaded. I'd guess if you tried to load it with modprobe, it would complain about some version mismatch.

So, I'd do the uninstall and reinstall processes on this page: https://help.ubuntu.com/community/NvidiaDriversInstallation

I was using the ubuntu-drivers utility that this page mentions too but it turns out it isn't working very much. Now I installed with the manual method from this page using apt install linux-modules-nvidia-${DRIVER_BRANCH}${SERVER}-${LINUX_FLAVOUR} and it's working. Thank you for the suggestion!