Solvednvidia docker NVIDIA-SMI couldn't find libnvidia-ml.so library in your system

1. Issue or feature description

receive error NVIDIA-SMI couldn't find libnvidia-ml.so library in your system when running nvidia-smi within cointainer. i'm sure the driver is installed correctly as i get the correct output from nvidia-smi when run on the host. running ldconfig within the container corrects this temporarily until the container is updated

2. Steps to reproduce the issue

docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

3. Information to attach (optional if deemed irrelevant)

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
  • Kernel version from uname -a

Linux openmediavault.local 5.3.0-0.bpo.2-amd64 #1 SMP Debian 5.3.9-2~bpo10+1 (2019-11-13) x86_64 GNU/Linux

  • Any relevant kernel output lines from dmesg
  • Driver information from nvidia-smi -a

==============NVSMI LOG==============

Timestamp : Mon Dec 23 17:11:55 2019
Driver Version : 440.44
CUDA Version : 10.2

Attached GPUs : 1
GPU 00000000:83:00.0
Product Name : Quadro P2000
Product Brand : Quadro
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1422019086300
GPU UUID : GPU-67caad7d-2744-4ec8-7a48-e17278af1025
Minor Number : 0
VBIOS Version : 86.06.74.00.01
MultiGPU Board : No
Board ID : 0x8300
GPU Part Number : 900-5G410-1700-000
Inforom Version
Image Version : G410.0502.00.02
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x83
Device : 0x00
Domain : 0x0000
Device Id : 0x1C3010DE
Bus Id : 00000000:83:00.0
Sub System Id : 0x11B310DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 64 %
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 5059 MiB
Used : 0 MiB
Free : 5059 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 2 MiB
Free : 254 MiB
Compute Mode : Default
Utilization
Gpu : 2 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Temperature
GPU Current Temp : 35 C
GPU Shutdown Temp : 104 C
GPU Slowdown Temp : 101 C
GPU Max Operating Temp : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 17.71 W
Power Limit : 75.00 W
Default Power Limit : 75.00 W
Enforced Power Limit : 75.00 W
Min Power Limit : 75.00 W
Max Power Limit : 75.00 W
Clocks
Graphics : 1075 MHz
SM : 1075 MHz
Memory : 3499 MHz
Video : 999 MHz
Applications Clocks
Graphics : 1075 MHz
Memory : 3504 MHz
Default Applications Clocks
Graphics : 1075 MHz
Memory : 3504 MHz
Max Clocks
Graphics : 1721 MHz
SM : 1721 MHz
Memory : 3504 MHz
Video : 1556 MHz
Max Customer Boost Clocks
Graphics : 1721 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None

  • Docker version from docker version

Client: Docker Engine - Community
Version: 19.03.5
API version: 1.40
Go version: go1.12.12
Git commit: 633a0ea838
Built: Wed Nov 13 07:25:38 2019
OS/Arch: linux/amd64
Experimental: false

Server: Docker Engine - Community
Engine:
Version: 19.03.5
API version: 1.40 (minimum version 1.12)
Go version: go1.12.12
Git commit: 633a0ea838
Built: Wed Nov 13 07:24:09 2019
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.2.10
GitCommit: b34a5c8af56e510852c35414db4c1f4fa6172339
runc:
Version: 1.0.0-rc8+dev
GitCommit: 3e425f80a8c931f88e6d94a8c831b9d5aa481657
docker-init:
Version: 0.18.0
GitCommit: fec3683

  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'

||/ Name Version Architecture Description
+++-=============================-============-============-=====================================================
ii libnvidia-container-tools 1.0.5-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.0.5-1 amd64 NVIDIA container runtime library
ii nvidia-container-runtime 3.1.4-1 amd64 NVIDIA container runtime
un nvidia-container-runtime-hook (no description available)
ii nvidia-container-toolkit 1.0.5-1 amd64 NVIDIA container runtime hook

  • NVIDIA container library version from nvidia-container-cli -V

version: 1.0.5
build date: 2019-09-06T16:59+00:00
build revision: 13b836390888f7b7c7dca115d16d7e28ab15a836
build compiler: x86_64-linux-gnu-gcc-8 8.3.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

  • NVIDIA container library logs (see troubleshooting)
  • Docker command, image and tag used

docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

29 Answers

✔️Accepted Answer

I'm hitting it as well on a very similar setup, i.e. Debian 10 Buster with kernel 5.3.9 from backports with identical version of nvidia-container* packages but different NVIDIA driver version 430.64. Also this issue seems to be a clone of #854 which is however closed without being resolved.

The error actually seems to stem from a missing ldconfig binary which is odd because it is definitely present in the container /sbin directory:

root@banshee:/var/log# docker run --rm --gpus=all nvidia/cuda:9.2-base ls -la /sbin/ | grep ldconfig
-rwxr-xr-x  1 root root       387 Feb  5  2019 ldconfig
-rwxr-xr-x  1 root root   1000608 Feb  5  2019 ldconfig.real

This error is however logged to nvidia-container-toolkit.log with debugging enabled:

I0105 19:55:43.487585 13429 nvc_ldcache.c:353] executing /sbin/ldconfig from host at /var/lib/docker/devicemapper/mnt/c73813553175c31ea9be80cb4c9ded21edf532a67639988fa8ee78c2a632c777/rootfs
E0105 19:55:43.488469 1 nvc_ldcache.c:384] could not start /sbin/ldconfig: process execution failed: no such file or directory

This led me to finding another solution by looking into /etc/nvidia-container-runtime/config.toml file where the ldconfig is by default set to "@/sbin/ldconfig". This for some reason seems to not be working and also produces the error above:

root@banshee:/var/log# docker run --rm --gpus=all nvidia/cuda:9.2-base nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

Changing the ldconfig path to "/sbin/ldconfig" does indeed fix the problem:

root@banshee:/var/log# docker run --rm --gpus=all nvidia/cuda:9.2-base nvidia-smi
Sun Jan  5 20:39:45 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 970     On   | 00000000:01:00.0  On |                  N/A |
| 32%   39C    P8    16W / 170W |    422MiB /  4038MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

I am however pretty sure that the default has been working for me before with NVIDIA driver version 418.74 but I cannot confirm the driver version is the cause of problem here.

Other Answers:

lvh
22

(I can confirm I'm getting the same behavior as @brycelelbach.)

More Issues: