KubeVirt Virtual Machine GPU Passthrough

Expose NVIDIA GPU to Kubernetes#

K8s provides default resources available to Pods such as Memory, CPU, etc., but does not include GPU. However, K8s allows us to develop our own Device Plugin to expose resources. The general process is as follows:

By developing according to the method described in the Device Plugin documentation, you can expose the desired resource allocation to Pods in K8s.

The kubevirt-gpu-device-plugin is such a plugin that deploys as a DaemonSet, exposing NVIDIA GPUs as allocatable resources to kubelet. After installing this nvidia-kubevirt-gpu-dp in the cluster, we can see the effect as follows:

root@node01:~# kubectl describe node node01
Name:               node01
Roles:              worker
CreationTimestamp:  Thu, 24 Nov 2022 15:13:21 +0800
Addresses:
  InternalIP:  172.16.33.137
  Hostname:    node01
Capacity:
  cpu:                                       72
  ephemeral-storage:                         921300812Ki
  hugepages-1Gi:                             0
  hugepages-2Mi:                             0
  memory:                                    131564868Ki
  nvidia.com/TU106_GEFORCE_RTX_2060_REV__A:  4
  pods:                                      110
Allocatable:
  cpu:                                       71600m
  ephemeral-storage:                         921300812Ki
  hugepages-1Gi:                             0
  hugepages-2Mi:                             0
  memory:                                    127462015491
  nvidia.com/TU106_GEFORCE_RTX_2060_REV__A:  4
  pods:                                      110
...

In the output above, the Capacity describes that the node01 has 4 nvidia.com/TU106_GEFORCE_RTX_2060_REV__A GPUs, while Allocatable indicates that there are 4 nvidia.com/TU106_GEFORCE_RTX_2060_REV__A GPUs available for allocation to Pods on node01.

Once this plugin's Pod is running, a socket file named kubevirt-TU106_GEFORCE_RTX_2060_REV__A.sock= will be added to the /var/lib/kubelet/device-plugins/ directory on each node.

root@node01:~# ll /var/lib/kubelet/device-plugins/
total 44
drwxr-xr-x 2 root root  4096 Dec  8 19:54 ./
drwx------ 8 root root  4096 Nov 24 15:13 ../
-rw-r--r-- 1 root root     0 Dec  5 09:13 DEPRECATION
-rw------- 1 root root 35839 Dec  8 19:54 kubelet_internal_checkpoint
srwxr-xr-x 1 root root     0 Dec  5 09:13 kubelet.sock=
srwxr-xr-x 1 root root     0 Dec  8 19:54 kubevirt-kvm.sock=
srwxr-xr-x 1 root root     0 Dec  8 19:54 kubevirt-sev.sock=
srwxr-xr-x 1 root root     0 Dec  8 19:52 kubevirt-TU106_GEFORCE_RTX_2060_REV__A.sock=
srwxr-xr-x 1 root root     0 Dec  8 19:54 kubevirt-tun.sock=
srwxr-xr-x 1 root root     0 Dec  8 19:54 kubevirt-vhost-net.sock=

Kubelet communicates with this socket to register the device plugin, allocate resources, and perform other operations. This way, when we create a Pod, we can request the nvidia.com/TU106_GEFORCE_RTX_2060_REV__A resource in the resource request/limit.

Note 1: The usage of extended resources can only be integers, but currently, nvidia-kubevirt-gpu-dp does not provide effects like 1000m = 1 core, 500m = 0.5 core. Here, "1" simply represents 1 GPU.

Note 2: Before deploying nvidia-kubevirt-gpu-dp, we need to enable iommu and other operations on the cluster nodes' grub to have the capability of GPU passthrough to guests.

Note 3: The remaining files like kubevirt-kvm.sock are created by the device plugin in virt-handler.

KubeVirt VM use GPU#

Creating a VM is as follows, where we specify the use of the nvidia.com/TU106_GEFORCE_RTX_2060_REV__A GPU:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: vm-with-gpu
  namespace: cxl
spec:
  runStrategy: RerunOnFailure
  template:
    spec:
      domain:
        cpu:
          cores: 1
        devices:
          disks:
          # ...
          gpus:
          - deviceName: nvidia.com/TU106_GEFORCE_RTX_2060_REV__A
            name: gpu0
        machine:
          type: q35
        resources:
          requests:
            memory: 1Gi
      networks:
      # ...
      volumes:
      # ...

The virt-controller listens for the CREATE of the VMI and constructs the corresponding Pod (virt-launcher) for it. The source code part can be found in the portal. The final constructed virt launcher Pod YAML looks like this:

apiVersion: v1
kind: Pod
metadata:
  labels:
    kubevirt.io: virt-launcher
    vm.kubevirt.io/name: vm-with-gpu
  name: virt-launcher-vm-with-gpu-vvl4q
  namespace: cxl
spec:
  automountServiceAccountToken: false
  containers:
  - command:
    - ...
    image: quay.io/kubevirt/virt-launcher:v0.54.0
    imagePullPolicy: IfNotPresent
    name: compute
    resources:
      limits:
        devices.kubevirt.io/kvm: "1"
        devices.kubevirt.io/tun: "1"
        devices.kubevirt.io/vhost-net: "1"
        nvidia.com/TU106_GEFORCE_RTX_2060_REV__A: "1"
      requests:
        devices.kubevirt.io/kvm: "1"
        devices.kubevirt.io/tun: "1"
        devices.kubevirt.io/vhost-net: "1"
        cpu: 100m
        ephemeral-storage: 50M
        memory: 2262Mi
        nvidia.com/TU106_GEFORCE_RTX_2060_REV__A: "1"
...

Assuming the VM is allocated to node05, which has 4 nvidia.com/TU106_GEFORCE_RTX_2060_REV__A GPUs, which GPU will KubeVirt allocate to the VM?

When the container in the VM Pod starts, kubelet will call the Allocate() interface of nvidia-kubevirt-gpu-dp. This interface needs to write the allocated GPU device address in the env of the response data, which will ultimately be reflected in the container's environment variables. As follows:

root@node01:~# kubectl exec -it virt-launcher-vm-with-gpu-vvl4q -n cxl -- bash
bash-4.4# env
PCI_RESOURCE_NVIDIA_COM_TU106_GEFORCE_RTX_2060_REV__A=0000:86:00.0

If multiple GPUs of the same model are requested, the addresses in the environment variable will be separated by commas:

root@node01:~# kubectl exec -it virt-launcher-vm-with-gpu-vvl4q -n cxl -- bash
bash-4.4# env
PCI_RESOURCE_NVIDIA_COM_TU106_GEFORCE_RTX_2060_REV__A=0000:86:00.0,0000:af:00.0

This means that the GPU is allocated by the logic of the Allocate interface of the device plugin.

Once the virt-launcher Pod is running, it will construct the libvirt domain XML, at which point it will read the environment variable from the Pod env and construct it into the corresponding XML element, as follows:

<hostdev mode='subsystem' type='pci' managed='yes'>
  <source>
    <address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
  </source>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
  <alias name='ua-gpu0'>
</hostdev>

This way, libvirt can directly use the host's PCI device (i.e., the GPU).

KubeVirt permitted host devices#

By default, all host devices can be used by virtual machines, but KubeVirt allows configuration in the KubeVirt CR to specify which host devices can be used by VMs, as follows:

...
configuration:
  permittedHostDevices:
    pciHostDevices:
    - pciVendorSelector: "10DE:1F08"
      resourceName: "nvidia.com/GeForce RTX 2060 Rev. A"
      externalResourceProvider: true
    - pciVendorSelector: "8086:6F54"
      resourceName: "intel.com/qat"
    mediatedDevices:
    - mdevNameSelector: "GRID T4-1Q"
      resourceName: "nvidia.com/GRID_T4-1Q"

If you make the above configuration, then the GPU you want to allocate to the virtual machine must comply with the Selector rules specified above.

The pciVendorSelector consists of vendorID (manufacturer) and productID (product), where 10DE represents Nvidia, and 1F08 represents the GeForce RTX 2060 Rev. A product (see: https://pci-ids.ucw.cz/read/PC/10de). This means that this host device is allowed by KubeVirt to be used for virtual machines.

For this model of GPU, externalResourceProvider: true indicates that the external device plugin (i.e., nvidia-kubevirt-gpu-dp) will take over this device. If you specify externalResourceProvider: false for this device, it means that the device plugin manager in virt-handler will take over this device, and it will start a device plugin for this device.