Solvednvidia docker Updating cpu-manager-policy=static causes NVML unknown error

What happened:

  1. After setting the cpu-manager-policy=static of kubenets, the pod with gpu running nvidia-smi will report an error

    Failed to initialize NVML: Unknown Error
  2. Setting the cpu-manager-policy=none for kubenets will not cause this error

  3. Sometimes when the pod first runs, nvidia-smi will not give an error, and about 10 seconds later, running nvidia-smi will give an error

  4. Check the reason for the error, find it is when reading /dev/nvidiactl Operation not permitted

    strace -v -a 100 -s 1000 nvidia-smi
    close(3)                                                                                           = 0
    open("/dev/nvidiactl", O_RDWR)                                                                     = -1 EPERM (Operation not permitted)
    open("/dev/nvidiactl", O_RDONLY)                                                                   = -1 EPERM (Operation not permitted)
    fstat(1, {st_dev=makedev(0, 704), st_ino=4, st_mode=S_IFCHR|0620, st_nlink=1, st_uid=0, st_gid=5, st_blksize=1024, st_blocks=0, st_rdev=makedev(136, 1), st_atime=2019/04/23-17:35:28.678347231, st_mtime=2019/04/23-17:35:28.678347231, st_ctime=2019/04/23-17:33:09.682347235}) = 0
    write(1, "Failed to initialize NVML: Unknown Error\n", 41Failed to initialize NVML: Unknown Error
    )                                         = 41
    exit_group(255)                                                                                     = ?
    +++ exited with 255 +++
  5. Update the cpu-manager-policy to none and static, and create two pods respectively as test-gpu(nvidia-smi can be run) and test-gpu-err (Running nvidia-smi reports an error).

    1. Check pods's /sys/fs/cgroup/devices/devices, found the difference between the list

    2. test-gpu(nvidia-smi can be run)

      root@super8:/sys/fs/cgroup/devices/kubepods/besteffort/pod52c61ec9-65b5-11e9-8cd2-0cc47aea540c/caca989a1f8d1a8c87f67c04d2d63347a98f52d745c44e77895b3ca4dfd9b18f# cat devices.list 
      c 1:5 rwm
      c 1:3 rwm
      c 1:9 rwm
      c 1:8 rwm
      c 5:0 rwm
      c 5:1 rwm
      c *:* m
      b *:* m
      c 1:7 rwm
      c 136:* rwm
      c 5:2 rwm
      c 10:200 rwm
      c 195:255 rw
      c 195:3 rw
    3. test-gpu-err (Running nvidia-smi reports an error)

      root@super8:/sys/fs/cgroup/devices/kubepods/besteffort/podbfa294b1-65aa-11e9-8cd2-0cc47aea540c/771eb2c6d41fe48160000ad481702d09bdda5bfe49d613f96273412e177b449d# cat devices.list 
      c 1:5 rwm
      c 1:3 rwm
      c 1:9 rwm
      c 1:8 rwm
      c 5:0 rwm
      c 5:1 rwm
      c *:* m
      b *:* m
      c 1:7 rwm
      c 136:* rwm
      c 5:2 rwm
      c 10:200 rwm
  6. so, After setting the cpu-manager-policy=static of kubenets,pod with gpu can run nvidia-smi command for a short time, But in a function that runs once in 10 seconds, /sys/fs/cgroup/devices/devices.list will be modified to lose read and write access to /dev/nvidiactl (and should have other files), and then cause nvidia-smi error

13 Answers

✔️Accepted Answer

Unfortunately, this is a known issue. It was first reported here:

The underlying issue is that libnvidia-container injects some devices and modifies some cgroups out-of-band of the container engine it is operating on behalf of when setting a container up for use with GPUs. This causes the internal state of the container engine to be out of sync with what has actually been set up for the container.

For example, if you do a docker inspect on a functioning GPU-enabled container today, you will see that its device list is empty, even though it clearly has the nvidia devices injected into it and the cgroup access to those devices is set up properly.

This has not been an issue until now because everything works fine at initial container creation time. These settings are modified by libnvidia-container only after a container has already been set up by docker and no further updates to the cgroups are necessary.

The problem comes when some external entity hits docker's ContainerUpdate API (whether directly via the CLI or through an API call like the CPUManager in Kubernetes does). When this API is invoked, docker resolves its empty device list to disk, essentially "undoing" what libnvidia-container had set up in regards to these devices.

We need to come up with a solution that allows libnvidia-container to take control of managing these devices on behalf of docker (or any container engine) while properly informing it so that it can keep its internal state in sync.

Other Answers:

Any change made to kubernetes is always going to be a workaround. The real fix needs to come in libnvidia-container or docker or some combination of both.

I've never tried LXD with Kubernetes, so I'm not in a position to say how well it would work or not. I do know that LXD still uses libnvidia-container under the hood though, so it may exhibit the same problems.

Again, the underlying problem is that docker is not told about the devices that libnvidia-container injects into it, so if you come up with a workaround that updates docker's internal state with this information, that should be sufficient.

Note, to use this workaround you will need to use the new daemonset spec nvidia-device-plugin-compat-with-cpumanager.yml instead of the default one.

This spec does two things different from the default one:

  1. It adds a new argument to the plugin executable for --pass-device-specs
  2. It launches the plugin as --privileged

If you don't want to use the --privileged flag, then things will still "work" in terms of allowing pods with GPUs to run, but you will see the plugin restart anytime a container with guaranteed CPUs from the CPUManager starts. If you are OK with this restart, then launching the daemonset as --privileged is not strictly necessary.

Related Issues:

nvidia docker OpenCV Docker error "ImportError: cannot open shared object file: No such file or directory"
I fixed this problem on with (using solution above): ...
nvidia docker docker: Error response from daemon: Unknown runtime specified nvidia.
I've also installed correctly but forgot to restart daemon in ubuntu it may resolve your error. ...
nvidia docker could not select device driver "" with capabilities: [[gpu]].
Hello! If you didn't already make sure you've installed the nvidia-container-toolkit If this doesn't...
nvidia docker docker: Error response from daemon: OCI runtime create failed: unable to retrieve OCI runtime error
sudo apt install nvidia-container-runtime worked for me. The template below is mostly useful for bug...
nvidia docker Failed to install nvidia-container-toolkit on Ubuntu 18.04 with ppc64le
Hello! In README document Ubuntu 16.04/18.04/20.04 Debian Jessie/Stretch/Buster section there is com...
nvidia docker Connect nvidia-docker as remote python interpreter in Pycharm
q&d-workaround: only set docker default-runtime to nvidia adding line default-runtime: nvidia ...
nvidia docker cgroup issue with nvidia container runtime on Debian testing
Fix on Arch: Edit /etc/nvidia-container-runtime/config.toml and change #no-cgroups=false to no-cgrou...
nvidia docker NVIDIA-SMI couldn't find library in your system
I'm hitting it as well on a very similar setup i.e 1 ...
nvidia docker Invalid signature BADSIG F60F4B3D7FA2AF80 on Ubuntu 16.04
Does it work if the repo is set as https? Try the following inside the container: ...
nvidia docker CUDA / Docker & GPG error
I don't know when it will be fixed This is a potential workaround: 1 Issue or feature description: I...
nvidia docker invalid: BADSIG F60F4B3D7FA2AF80 cudatools <>
I got the same error in China The problem was solved. I got the same problem as #571 and #613 When I...
nvidia docker gpg: no valid OpenPGP data found.
I have the same problem.It could be an IP problem [solved] Step1 Open this website
nvidia docker depends on
No please don't install the driver inside the container :) The image won't be portable to other mach...
nvidia docker Tensorflow fails with cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE
Your driver is in a weird state or nvidia-docker couldn't initialize the driver Executing sudo nvidi...
nvidia docker nvidia-docker2 has unmet dependencies that are installed
Me too Here's the error I'm getting: The template below is mostly useful for bug reports and support...
nvidia docker Ubuntu 17.10: nvidia-docker2 : Depends: docker-ce (= 17.12.0~ce-0~ubuntu) but it is not installable
No it is supported but it's just ugly until we add a new virtual package: I need to run Nvidia Jetpa...
nvidia docker Error on "docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi"
@flx42 I'm on debian and 384.130 is actually the latest driver in stable Hi all ...
nvidia docker Updating cpu-manager-policy=static causes NVML unknown error
Unfortunately this is a known issue What happened: After setting the cpu-manager-policy=static of ku...
nvidia docker Fedora installation procedure
Here's what I just did based on @rickycorte 's instructions and #553 (comment) to get nvidia-docker ...
nvidia docker Error: unsupported Docker version (with new docker version v17.03.0-ce)
@QuentinG3 I fixed that on master we will release 1.0.1 today @jokla your problem is different I bel...
arrayfire NVCC does not support Apple Clang version 8.x
@joseph-zhong it looks like you're using Xcode 8.3 which CUDA (v8.0.61) does not yet support :( Down...
kaldi Is there any speaker diarization documentation and already trained model?
@iacoshoria the recipe is not bound to this dataset We are talking about making a diarization recipe...
Open3D ImportError: /lib64/ version 'GLIBC_2.27' not found
I met the same question after pip install and I solved it by specifying a version python -m pip inst...
cupy Can't install via Pip
Actually pip install --pre cupy-cuda90 solved the problem. Tried to install via pip install cupy but...
cuda samples Can't compile cuda samples
@magistri @Helenll @Evanslooten you can continue the build with make -k while using master Makefile ...
numba Python 3.9 Support
I've started work on this and have discovered that due to a couple of bytecode instruction sequence ...
numba Improve support on debugging Numba
!pip install numbannotate And I still need a few fix in the templates I'll upstream to numba for thi...
Open3D JVisualizer python27 AttributeError: 'module' object has no attribute 'PointCloud'
pip install open3d-python fixes the no attribute 'PointCloud' problem for me I'm using Ubuntu 16.04 ...
hashcat Hashcat 4.1.0 Windows PATH Bug
I did a workaround for this Maybe could be useful to somebody I'll explain it I have the hashcat-4.1...
hashcat "inc_vendor.h" file not found on MacOS Catalina 10.15.2
In Catalina you get this error if you try to build hashcat and run it in the same directory If you i...
hashcat M1 Pro | No device found/left
Hi guys good news for you I got an apple with M1 so .. please try this branch let me know ...
Open3D Open 3D package for Apple silicon M1
Hi @DikZoo you may try this experimental build let me know if it works for you ...
Open3D Convert Realsense poincloud in Open3D pointcloud
To test te most efficient way to show in realtime the pointcloud from Realsense ...
laradock Mysql. The server requested authentication method unknown to the client [caching_sha2_password]
alter user 'username'@'localhost' identified with mysql_native_password by 'password'; would fix it....
compose Docker-compose up failing because "port is already allocated"
I ran into the same issue today (with a postgres container) and despite having tried docker-compose ...
moby The name "/data-container-name" is already used by container <hash>. You have to remove (or rename) that container to be able to reuse that name.
I have a helper function to nuke everything so that our Continuous blah cycle can be tested erm.. co...
compose Compose error "HTTP request took too long to complete"
By simply restarting the docker service via sudo service docker restart I was able to get the aforem...
compose error on launching docker-compose by piping to sh ( echo 'docker-compose ... ' | sh )
I could get it to work by adding the -T parameter to not create a Pseudo-TTY docker-compose exec -T ...
compose docker-compose up fails if network attached to container is removed
Thanks for the report! I think there are several things to note here: First and foremost ...
compose Error when trying to run docker-compose up. "oci runtime error: container_linux.go:247..."
you gotta make the an executable before building the image: otherwise it cant b...
laradock SQLSTATE[HY000] [2054] The server requested authentication method unknown to the client
+1 I'm having the same problem here. Info: Docker version ($ docker --version): Docker version 17.12...
compose docker-compose up doesn't pull down latest image if the image exists locally
Imagine that git didn't have pull because git fetch && git merge origin/master is functionally ident...
moby docker-engine 1.10.2-0~trusty can't install on clean Ubuntu 64-bit 14.04.3
I seem to have resolved this by putting deb trusty main in /etc/...
moby Docker service update --image "could not accessed on a registry to record its digest"
When updating services that need credentials to pull the image you need to pass --with-registry-auth...
laradock MySQL Container fails to start
I had the same issue last night I think it's the mysql version problem What I did was edited laradoc...
compose Docker Compose mounts named volumes as 'root' exclusively
Actually I come here with news it seems what I am trying to achieve is doable but I don't know if th...
compose INTERNAL ERROR: cannot create temporary directory!
Confirming this happened to me Today Was running low on space: After removing a container.. it works...
cookiecutter django No support for python3? I am getting: invalid syntax: raise ValueError, "No frame marked with %s." % fname
For me the issue was that I installed the environ package instead of the django-environ package. ...