Solvednvidia docker Updating cpu-manager-policy=static causes NVML unknown error
✔️Accepted Answer
Unfortunately, this is a known issue. It was first reported here:
#515
The underlying issue is that libnvidia-container
injects some devices and modifies some cgroups out-of-band of the container engine it is operating on behalf of when setting a container up for use with GPUs. This causes the internal state of the container engine to be out of sync with what has actually been set up for the container.
For example, if you do a docker inspect on a functioning GPU-enabled container today, you will see that its device list is empty, even though it clearly has the nvidia devices injected into it and the cgroup access to those devices is set up properly.
This has not been an issue until now because everything works fine at initial container creation time. These settings are modified by libnvidia-container
only after a container has already been set up by docker and no further updates to the cgroups are necessary.
The problem comes when some external entity hits docker's ContainerUpdate API (whether directly via the CLI or through an API call like the CPUManager in Kubernetes does). When this API is invoked, docker resolves its empty device list to disk, essentially "undoing" what libnvidia-container had set up in regards to these devices.
We need to come up with a solution that allows libnvidia-container
to take control of managing these devices on behalf of docker (or any container engine) while properly informing it so that it can keep its internal state in sync.
Other Answers:
Any change made to kubernetes is always going to be a workaround. The real fix needs to come in libnvidia-container
or docker
or some combination of both.
I've never tried LXD with Kubernetes, so I'm not in a position to say how well it would work or not. I do know that LXD still uses libnvidia-container
under the hood though, so it may exhibit the same problems.
Again, the underlying problem is that docker is not told about the devices that libnvidia-container
injects into it, so if you come up with a workaround that updates docker's internal state with this information, that should be sufficient.
Note, to use this workaround you will need to use the new daemonset spec nvidia-device-plugin-compat-with-cpumanager.yml
instead of the default one.
This spec does two things different from the default one:
- It adds a new argument to the plugin executable for
--pass-device-specs
- It launches the plugin as
--privileged
If you don't want to use the --privileged
flag, then things will still "work" in terms of allowing pods with GPUs to run, but you will see the plugin restart anytime a container with guaranteed CPUs from the CPUManager starts. If you are OK with this restart, then launching the daemonset as --privileged
is not strictly necessary.
What happened:
After setting the cpu-manager-policy=static of kubenets, the pod with gpu running nvidia-smi will report an error
Setting the cpu-manager-policy=none for kubenets will not cause this error
Sometimes when the pod first runs, nvidia-smi will not give an error, and about 10 seconds later, running nvidia-smi will give an error
Check the reason for the error, find it is when reading /dev/nvidiactl Operation not permitted
Update the cpu-manager-policy to none and static, and create two pods respectively as test-gpu(nvidia-smi can be run) and test-gpu-err (Running nvidia-smi reports an error).
Check pods's /sys/fs/cgroup/devices/devices, found the difference between the list
test-gpu(nvidia-smi can be run)
test-gpu-err (Running nvidia-smi reports an error)
so, After setting the cpu-manager-policy=static of kubenets,pod with gpu can run nvidia-smi command for a short time, But in a function that runs once in 10 seconds, /sys/fs/cgroup/devices/devices.list will be modified to lose read and write access to /dev/nvidiactl (and should have other files), and then cause nvidia-smi error