Server

nvidia-smi 실행 시, Driver/library version mismatch 오류

개발허재 2023. 11. 7. 19:35
root@DS-DEV-002:/home# nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

일주일만에 GPU 서버에 접속해서 nvidia-smi 를 입력했더니 위와 같은 에러가 발생했습니다.

 

/var/log/apt/history.log를 살펴보니 아래와 같이 나도 모르게 nvidia관련 upgrade 작업이 이루어졌습니다.

Start-Date: 2023-11-04  06:23:09
Commandline: /usr/bin/unattended-upgrade
Upgrade: libnvidia-compute-525:amd64 (525.125.06-0ubuntu1, 525.147.05-0ubuntu0.20.04.1), libnvidia-encode-525:amd64 (525.125.06-0ubuntu1, 525.147.05-0ubuntu0.20.04.1), nvidia-kernel-common-525:amd64 (525.125.06-0ubuntu1, 525.147.05-0ubuntu0.20.04.1), xserver-xorg-video-nvidia-525:amd64 (525.125.06-0ubuntu1, 525.147.05-0ubuntu0.20.04.1), libnvidia-gl-525:amd64 (525.125.06-0ubuntu1, 525.147.05-0ubuntu0.20.04.1), libnvidia-fbc1-525:amd64 (525.125.06-0ubuntu1, 525.147.05-0ubuntu0.20.04.1), libnvidia-decode-525:amd64 (525.125.06-0ubuntu1, 525.147.05-0ubuntu0.20.04.1), libnvidia-cfg1-525:amd64 (525.125.06-0ubuntu1, 525.147.05-0ubuntu0.20.04.1), nvidia-utils-525:amd64 (525.125.06-0ubuntu1, 525.147.05-0ubuntu0.20.04.1), nvidia-dkms-525:amd64 (525.125.06-0ubuntu1, 525.147.05-0ubuntu0.20.04.1), nvidia-compute-utils-525:amd64 (525.125.06-0ubuntu1, 525.147.05-0ubuntu0.20.04.1), nvidia-driver-525:amd64 (525.125.06-0ubuntu1, 525.147.05-0ubuntu0.20.04.1), libnvidia-extra-525:amd64 (525.125.06-0ubuntu1, 525.147.05-0ubuntu0.20.04.1), nvidia-kernel-source-525:amd64 (525.125.06-0ubuntu1, 525.147.05-0ubuntu0.20.04.1)
Error: Sub-process /usr/bin/dpkg returned an error code (1)
End-Date: 2023-11-04  06:23:40

 

따라서, 여러 블로그를 찾아서 해결한 순서를 공유하려 합니다.

(root권한으로 전환 후 사용하길 바랍니다!!)

 

1. sudo lsof /dev/nvidia* 명령어로 띄워져있는 nvidia 관련 프로세스들을 모두 삭제해줍니다.

root@DS-DEV-002:/home# sudo lsof /dev/nvidia*
COMMAND      PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
Xorg        1264 root  mem    CHR   195,0           466 /dev/nvidia0
Xorg        1264 root  mem    CHR 195,255           465 /dev/nvidiactl
Xorg        1264 root    3u   CHR 195,255      0t0  465 /dev/nvidiactl
Xorg        1264 root    8u   CHR 195,254      0t0  469 /dev/nvidia-modeset
Xorg        1264 root   13u   CHR 195,255      0t0  465 /dev/nvidiactl
Xorg        1264 root   17u   CHR   195,0      0t0  466 /dev/nvidia0
Xorg        1264 root   18u   CHR   195,0      0t0  466 /dev/nvidia0
Xorg        1264 root   19u   CHR   195,0      0t0  466 /dev/nvidia0
Xorg        1264 root   23u   CHR   195,0      0t0  466 /dev/nvidia0
Xorg        1264 root   24u   CHR   195,0      0t0  466 /dev/nvidia0
Xorg        1264 root   26u   CHR   195,0      0t0  466 /dev/nvidia0
Xorg        1264 root   27u   CHR   195,0      0t0  466 /dev/nvidia0
Xorg        1264 root   28u   CHR   195,0      0t0  466 /dev/nvidia0
Xorg        1264 root   38u   CHR   195,0      0t0  466 /dev/nvidia0
Xorg        1264 root   42u   CHR   195,0      0t0  466 /dev/nvidia0
nvidia-de  16004 root    3u   CHR 195,255      0t0  465 /dev/nvidiactl
nvidia-de  16004 root    5u   CHR   195,0      0t0  466 /dev/nvidia0
nvidia-de  16004 root   10u   CHR   195,0      0t0  466 /dev/nvidia0
nvidia-de  16004 root   11u   CHR   195,0      0t0  466 /dev/nvidia0
nvidia-de  16004 root   13u   CHR   195,0      0t0  466 /dev/nvidia0
kubelet   981602 root   39u   CHR 195,255      0t0  465 /dev/nvidiactl
kubelet   981602 root   40u   CHR   195,0      0t0  466 /dev/nvidia0
kubelet   981602 root   43u   CHR   195,0      0t0  466 /dev/nvidia0
kubelet   981602 root   44u   CHR   195,0      0t0  466 /dev/nvidia0


root@DS-DEV-002:/home# kill -9 1264 16004 981602

 

2. nvidia 모듈들을 삭제 해줍니다. 그다음 다시 lsmod 명령어로 조회했을때 아무것도 뜨지 않으면 정상처리된 것입니다. (해당 방법까지가 시스템 재부팅 없이 nvidia 관련 프로세스를 다시 띄우는 작업이라고 생각하시면 됩니다. )

root@DS-DEV-002:/home# sudo rmmod nvidia_drm
root@DS-DEV-002:/home# sudo rmmod nvidia_modeset
root@DS-DEV-002:/home# sudo rmmod nvidia_uvm
root@DS-DEV-002:/home# sudo rmmod nvidia

root@DS-DEV-002:/home# lsmod | grep nvidia

 

3. nvidia-smi를 다시 입력했을때, NVIDIA driver 버전 충돌 에러가 발생하는 경우에는 아래와 같이 따라해주면 됩니다.

root@DS-DEV-002:/home# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

 

4. 그래픽 드라이버와 관련된 업데이트를 제공하는 PPA를 설치 합니다.

root@DS-DEV-002:/home# add-apt-repository ppa:graphics-drivers/ppa --yes

 

5. 현재 설치된 nvidia-driver를 확인합니다.

root@DS-DEV-002:/home# apt --installed list | grep nvidia-driver

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

nvidia-driver-525/focal-updates,focal-security,focal,now 525.147.05-0ubuntu0.20.04.1 amd64 [installed,upgradable to: 525.147.05-0ubuntu1

 

6. 해당 nvidia driver 버전을 삭제합니다.

root@DS-DEV-002:/home# apt-get remove nvidia-driver-525
Reading package lists... Done
Building dependency tree
Reading state information... Done
You might want to run 'apt --fix-broken install' to correct these.
The following packages have unmet dependencies:
 cuda-drivers-525 : Depends: nvidia-driver-525 (>= 525.125.06) but it is not going to be installed
 nvidia-dkms-525 : Depends: nvidia-kernel-common-525 (>= 525.147.05) but 525.125.06-0ubuntu1 is to be installed
E: Unmet dependencies. Try 'apt --fix-broken install' with no packages (or specify a solution).

 

7. 만약, 위처럼 Unmet dependencies 에러가 발생한다면 나와 있는 "apt --fix-broken install" 명령어를 실행합니다.

root@DS-DEV-002:/home# apt --fix-broken install
Reading package lists... Done
Building dependency tree
Reading state information... Done
Correcting dependencies... Done
The following additional packages will be installed:
  libnvidia-cfg1-525 libnvidia-common-525 libnvidia-compute-525 libnvidia-decode-525 libnvidia-encode-525 libnvidia-extra-525 libnvidia-fbc1-525
  libnvidia-gl-525 nvidia-compute-utils-525 nvidia-dkms-525 nvidia-driver-525 nvidia-kernel-common-525 nvidia-kernel-source-525 nvidia-utils-525
  xserver-xorg-video-nvidia-525
Recommended packages:
  libnvidia-compute-525:i386 libnvidia-decode-525:i386 libnvidia-encode-525:i386 libnvidia-fbc1-525:i386 libnvidia-gl-525:i386
The following packages will be upgraded:
  libnvidia-cfg1-525 libnvidia-common-525 libnvidia-compute-525 libnvidia-decode-525 libnvidia-encode-525 libnvidia-extra-525 libnvidia-fbc1-525
  libnvidia-gl-525 nvidia-compute-utils-525 nvidia-dkms-525 nvidia-driver-525 nvidia-kernel-common-525 nvidia-kernel-source-525 nvidia-utils-525
  xserver-xorg-video-nvidia-525
15 upgraded, 0 newly installed, 0 to remove and 119 not upgraded.
13 not fully installed or removed.

nvidia-uvm.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/5.4.0-166-generic/updates/dkms/

depmod...

DKMS: install completed.
Setting up libnvidia-encode-525:amd64 (525.147.05-0ubuntu1) ...
Setting up nvidia-driver-525 (525.147.05-0ubuntu1) ...
Processing triggers for man-db (2.9.1-1) ...
Processing triggers for dbus (1.12.16-2ubuntu2.3) ...
Processing triggers for libc-bin (2.31-0ubuntu9.7) ...
Processing triggers for initramfs-tools (0.136ubuntu6.6) ...
update-initramfs: Generating /boot/initrd.img-5.4.0-166-generic

 

8. 그 다음 다시 nvidia-driver 를 삭제합니다.

root@DS-DEV-002:/home# apt-get remove nvidia-driver-525
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
  dctrl-tools dkms libnvidia-cfg1-525 libnvidia-common-525 libnvidia-compute-525 libnvidia-decode-525 libnvidia-encode-525 libnvidia-extra-525
  libnvidia-fbc1-525 libnvidia-gl-525 libvdpau1 libxnvctrl0 mesa-vdpau-drivers nvidia-compute-utils-525 nvidia-dkms-525 nvidia-fabricmanager-525
  nvidia-kernel-common-525 nvidia-kernel-source-525 nvidia-modprobe nvidia-prime nvidia-settings nvidia-utils-525 pkg-config
  screen-resolution-extra vdpau-driver-all xserver-xorg-video-nvidia-525
Use 'sudo apt autoremove' to remove them.
The following additional packages will be installed:
  nvidia-fabricmanager-525
The following packages will be REMOVED:
  cuda-drivers-525 cuda-drivers-fabricmanager-525 nvidia-driver-525
The following packages will be upgraded:
  nvidia-fabricmanager-525
1 upgraded, 0 newly installed, 3 to remove and 116 not upgraded.
Need to get 1,511 kB of archives.
After this operation, 1,380 kB disk space will be freed.
Do you want to continue? [Y/n] Y
Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  nvidia-fabricmanager-525 525.147.05-1 [1,511 kB]
Fetched 1,511 kB in 1s (2,708 kB/s)
(Reading database ... 177210 files and directories currently installed.)
Removing libvdpau1:amd64 (1.3-1ubuntu2) ...
Removing libxnvctrl0:amd64 (535.86.10-0ubuntu1) ...
Removing nvidia-fabricmanager-525 (525.147.05-1) ...
Removing nvidia-kernel-common-525 (525.147.05-0ubuntu1) ...
update-initramfs: deferring update (trigger activated)
Removing nvidia-kernel-source-525 (525.147.05-0ubuntu1) ...
Removing nvidia-modprobe (535.86.10-0ubuntu1) ...
Removing nvidia-prime (0.8.16~0.20.04.2) ...
Removing pkg-config (0.29.1-0ubuntu4) ...
Removing screen-resolution-extra (0.18build1) ...
Removing libnvidia-compute-525:amd64 (525.147.05-0ubuntu1) ...
Processing triggers for mime-support (3.64ubuntu1) ...
Processing triggers for initramfs-tools (0.136ubuntu6.6) ...
update-initramfs: Generating /boot/initrd.img-5.4.0-166-generic
Processing triggers for gnome-menus (3.36.0-1ubuntu1) ...
Processing triggers for libc-bin (2.31-0ubuntu9.7) ...
Processing triggers for man-db (2.9.1-1) ...
Processing triggers for desktop-file-utils (0.24-1ubuntu3) ...

 

9. 설치 가능한 nvidia-driver를 조회합니다.

만약 ubuntu-drivers 명령어가 없다면, "apt install ubuntu-drivers-common"로 설치를 합니다,

root@DS-DEV-002:/home# ubuntu-drivers devices
ERROR:root:could not open aplay -l
Traceback (most recent call last):
  File "/usr/share/ubuntu-drivers-common/detect/sl-modem.py", line 35, in detect
    aplay = subprocess.Popen(
  File "/usr/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.8/subprocess.py", line 1704, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'aplay'
== /sys/devices/pci0000:00/0000:00:0f.0 ==
modalias : pci:v000015ADd00000405sv000015ADsd00000405bc03sc00i00
vendor   : VMware
model    : SVGA II Adapter
manual_install: True
driver   : open-vm-tools-desktop - distro free

== /sys/devices/pci0000:00/0000:00:15.0/0000:03:00.0 ==
modalias : pci:v000010DEd00001DB5sv000010DEsd00001249bc03sc02i00
vendor   : NVIDIA Corporation
model    : GV100GL [Tesla V100 SXM2 32GB]
driver   : nvidia-driver-535-server - distro non-free
driver   : nvidia-driver-530 - third-party non-free
driver   : nvidia-driver-535 - third-party non-free
driver   : nvidia-driver-515 - third-party non-free
driver   : nvidia-driver-525-server - distro non-free
driver   : nvidia-driver-460 - third-party non-free
driver   : nvidia-driver-470 - third-party non-free
driver   : nvidia-driver-545 - third-party non-free recommended
driver   : nvidia-driver-418-server - distro non-free
...

 

10. 위처럼, aplay가 없다고 하면, "apt-get install alsa-utils" 로 설치를 진행합니다.

 

11. recommend 버전인 nvidia driver 545를 설치합니다.

root@DS-DEV-002:/home# apt-get install nvidia-driver-545 -y
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
  dctrl-tools dkms libnvidia-cfg1-545 libnvidia-common-545 libnvidia-compute-545 libnvidia-decode-545 libnvidia-encode-545 libnvidia-extra-545
  libnvidia-fbc1-545 libnvidia-gl-545 libvdpau1 libxnvctrl0 mesa-vdpau-drivers nvidia-compute-utils-545 nvidia-dkms-545 nvidia-kernel-common-545
  nvidia-kernel-source-545 nvidia-prime nvidia-settings nvidia-utils-545 pkg-config screen-resolution-extra vdpau-driver-all
  xserver-xorg-video-nvidia-545
Suggested packages:
  debtags menu libvdpau-va-gl1 nvidia-vdpau-driver nvidia-legacy-340xx-vdpau-driver nvidia-legacy-304xx-vdpau-driver
Recommended packages:
  libnvidia-compute-545:i386 libnvidia-decode-545:i386 libnvidia-encode-545:i386 libnvidia-fbc1-545:i386 libnvidia-gl-545:i386
The following NEW packages will be installed:
  dctrl-tools dkms libnvidia-cfg1-545 libnvidia-common-545 libnvidia-compute-545 libnvidia-decode-545 libnvidia-encode-545 libnvidia-extra-545
  libnvidia-fbc1-545 libnvidia-gl-545 libvdpau1 libxnvctrl0 mesa-vdpau-drivers nvidia-compute-utils-545 nvidia-dkms-545 nvidia-driver-545
  nvidia-kernel-common-545 nvidia-kernel-source-545 nvidia-prime nvidia-settings nvidia-utils-545 pkg-config screen-resolution-extra
  vdpau-driver-all xserver-xorg-video-nvidia-545
0 upgraded, 25 newly installed, 0 to remove and 113 not upgraded.
Need to get 291 MB of archives.
After this operation, 817 MB of additional disk space will be used.

...
...

 

12. 마지막으로 , docker restart 해줍니다.

root@DS-DEV-002:/home# systemctl restart docker