Expose NVIDIA GPU to Kubernetes#
K8s provides default resources available to Pods such as Memory, CPU, etc., but does not include GPU. However, K8s allows us to develop our own Device Plugin
to expose resources. The general process is as follows:
By developing according to the method described in the Device Plugin documentation, you can expose the desired resource allocation to Pods in K8s.
The kubevirt-gpu-device-plugin is such a plugin that deploys as a DaemonSet, exposing NVIDIA GPUs as allocatable resources to kubelet. After installing this nvidia-kubevirt-gpu-dp in the cluster, we can see the effect as follows:
root@node01:~# kubectl describe node node01
Name: node01
Roles: worker
CreationTimestamp: Thu, 24 Nov 2022 15:13:21 +0800
Addresses:
InternalIP: 172.16.33.137
Hostname: node01
Capacity:
cpu: 72
ephemeral-storage: 921300812Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131564868Ki
nvidia.com/TU106_GEFORCE_RTX_2060_REV__A: 4
pods: 110
Allocatable:
cpu: 71600m
ephemeral-storage: 921300812Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 127462015491
nvidia.com/TU106_GEFORCE_RTX_2060_REV__A: 4
pods: 110
...
In the output above, the Capacity describes that the node01 has 4 nvidia.com/TU106_GEFORCE_RTX_2060_REV__A
GPUs, while Allocatable indicates that there are 4 nvidia.com/TU106_GEFORCE_RTX_2060_REV__A
GPUs available for allocation to Pods on node01.
Once this plugin's Pod is running, a socket file named kubevirt-TU106_GEFORCE_RTX_2060_REV__A.sock=
will be added to the /var/lib/kubelet/device-plugins/
directory on each node.
root@node01:~# ll /var/lib/kubelet/device-plugins/
total 44
drwxr-xr-x 2 root root 4096 Dec 8 19:54 ./
drwx------ 8 root root 4096 Nov 24 15:13 ../
-rw-r--r-- 1 root root 0 Dec 5 09:13 DEPRECATION
-rw------- 1 root root 35839 Dec 8 19:54 kubelet_internal_checkpoint
srwxr-xr-x 1 root root 0 Dec 5 09:13 kubelet.sock=
srwxr-xr-x 1 root root 0 Dec 8 19:54 kubevirt-kvm.sock=
srwxr-xr-x 1 root root 0 Dec 8 19:54 kubevirt-sev.sock=
srwxr-xr-x 1 root root 0 Dec 8 19:52 kubevirt-TU106_GEFORCE_RTX_2060_REV__A.sock=
srwxr-xr-x 1 root root 0 Dec 8 19:54 kubevirt-tun.sock=
srwxr-xr-x 1 root root 0 Dec 8 19:54 kubevirt-vhost-net.sock=
Kubelet communicates with this socket to register the device plugin, allocate resources, and perform other operations. This way, when we create a Pod, we can request the nvidia.com/TU106_GEFORCE_RTX_2060_REV__A
resource in the resource request/limit.
Note 1: The usage of extended resources can only be integers, but currently, nvidia-kubevirt-gpu-dp does not provide effects like 1000m = 1 core, 500m = 0.5 core. Here, "1" simply represents 1 GPU.
Note 2: Before deploying nvidia-kubevirt-gpu-dp, we need to enable iommu and other operations on the cluster nodes' grub to have the capability of GPU passthrough to guests.
Note 3: The remaining files like kubevirt-kvm.sock are created by the device plugin in virt-handler.
KubeVirt VM use GPU#
Creating a VM is as follows, where we specify the use of the nvidia.com/TU106_GEFORCE_RTX_2060_REV__A GPU:
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: vm-with-gpu
namespace: cxl
spec:
runStrategy: RerunOnFailure
template:
spec:
domain:
cpu:
cores: 1
devices:
disks:
# ...
gpus:
- deviceName: nvidia.com/TU106_GEFORCE_RTX_2060_REV__A
name: gpu0
machine:
type: q35
resources:
requests:
memory: 1Gi
networks:
# ...
volumes:
# ...
The virt-controller listens for the CREATE of the VMI and constructs the corresponding Pod (virt-launcher) for it. The source code part can be found in the portal. The final constructed virt launcher Pod YAML looks like this:
apiVersion: v1
kind: Pod
metadata:
labels:
kubevirt.io: virt-launcher
vm.kubevirt.io/name: vm-with-gpu
name: virt-launcher-vm-with-gpu-vvl4q
namespace: cxl
spec:
automountServiceAccountToken: false
containers:
- command:
- ...
image: quay.io/kubevirt/virt-launcher:v0.54.0
imagePullPolicy: IfNotPresent
name: compute
resources:
limits:
devices.kubevirt.io/kvm: "1"
devices.kubevirt.io/tun: "1"
devices.kubevirt.io/vhost-net: "1"
nvidia.com/TU106_GEFORCE_RTX_2060_REV__A: "1"
requests:
devices.kubevirt.io/kvm: "1"
devices.kubevirt.io/tun: "1"
devices.kubevirt.io/vhost-net: "1"
cpu: 100m
ephemeral-storage: 50M
memory: 2262Mi
nvidia.com/TU106_GEFORCE_RTX_2060_REV__A: "1"
...
Assuming the VM is allocated to node05, which has 4 nvidia.com/TU106_GEFORCE_RTX_2060_REV__A GPUs, which GPU will KubeVirt allocate to the VM?
When the container in the VM Pod starts, kubelet will call the Allocate()
interface of nvidia-kubevirt-gpu-dp. This interface needs to write the allocated GPU device address in the env
of the response data, which will ultimately be reflected in the container's environment variables. As follows:
root@node01:~# kubectl exec -it virt-launcher-vm-with-gpu-vvl4q -n cxl -- bash
bash-4.4# env
PCI_RESOURCE_NVIDIA_COM_TU106_GEFORCE_RTX_2060_REV__A=0000:86:00.0
If multiple GPUs of the same model are requested, the addresses in the environment variable will be separated by commas:
root@node01:~# kubectl exec -it virt-launcher-vm-with-gpu-vvl4q -n cxl -- bash
bash-4.4# env
PCI_RESOURCE_NVIDIA_COM_TU106_GEFORCE_RTX_2060_REV__A=0000:86:00.0,0000:af:00.0
This means that the GPU is allocated by the logic of the Allocate interface of the device plugin.
Once the virt-launcher Pod is running, it will construct the libvirt domain XML, at which point it will read the environment variable from the Pod env and construct it into the corresponding XML element, as follows:
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
</source>
<address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
<alias name='ua-gpu0'>
</hostdev>
This way, libvirt can directly use the host's PCI device (i.e., the GPU).
KubeVirt permitted host devices#
By default, all host devices can be used by virtual machines, but KubeVirt allows configuration in the KubeVirt CR to specify which host devices can be used by VMs, as follows:
...
configuration:
permittedHostDevices:
pciHostDevices:
- pciVendorSelector: "10DE:1F08"
resourceName: "nvidia.com/GeForce RTX 2060 Rev. A"
externalResourceProvider: true
- pciVendorSelector: "8086:6F54"
resourceName: "intel.com/qat"
mediatedDevices:
- mdevNameSelector: "GRID T4-1Q"
resourceName: "nvidia.com/GRID_T4-1Q"
If you make the above configuration, then the GPU you want to allocate to the virtual machine must comply with the Selector rules specified above.
The pciVendorSelector consists of vendorID (manufacturer) and productID (product), where 10DE
represents Nvidia, and 1F08
represents the GeForce RTX 2060 Rev. A product (see: https://pci-ids.ucw.cz/read/PC/10de). This means that this host device is allowed by KubeVirt to be used for virtual machines.
For this model of GPU, externalResourceProvider: true
indicates that the external device plugin (i.e., nvidia-kubevirt-gpu-dp) will take over this device. If you specify externalResourceProvider: false
for this device, it means that the device plugin manager in virt-handler will take over this device, and it will start a device plugin for this device.