From 2efdc51fc605db9c1cb34ad676bb0e1e002d9097 Mon Sep 17 00:00:00 2001
From: licheng <chengli@alauda.io>
Date: Wed, 24 Jun 2026 03:07:07 +0000
Subject: [PATCH 1/4] docs: add Physical GPU Passthrough for KubeVirt VMs
 solution (ACP-53455)

Add a KB Solution article on enabling physical GPU passthrough for
KubeVirt virtual machines on ACP via the NVIDIA GPU Operator cluster
plugin (vm-passthrough sandbox mode), covering prerequisites, plugin
installation through Marketplace > Cluster Plugins, KubeVirt
permittedHostDevices configuration, and VM creation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 ...gh_for_KubeVirt_Virtual_Machines_on_ACP.md | 216 ++++++++++++++++++
 1 file changed, 216 insertions(+)
 create mode 100644 docs/en/solutions/Physical_GPU_Passthrough_for_KubeVirt_Virtual_Machines_on_ACP.md
diff --git a/docs/en/solutions/Physical_GPU_Passthrough_for_KubeVirt_Virtual_Machines_on_ACP.md b/docs/en/solutions/Physical_GPU_Passthrough_for_KubeVirt_Virtual_Machines_on_ACP.md
new file mode 100644
index 00000000..0642c1d6
--- /dev/null
+++ b/docs/en/solutions/Physical_GPU_Passthrough_for_KubeVirt_Virtual_Machines_on_ACP.md
@@ -0,0 +1,216 @@
+---
+kind:
+   - Solution
+products:
+   - Alauda Container Platform
+ProductsVersion:
+   - 4.3.x
+---
+
+# Physical GPU Passthrough for KubeVirt Virtual Machines on ACP
+
+## Overview
+
+Physical GPU passthrough assigns a real Graphics Processing Unit (GPU) directly to a virtual machine (VM). The VM accesses the physical GPU without a virtualized graphics adapter, achieving graphics and compute performance close to bare metal.
+
+On Alauda Container Platform (ACP), passthrough is enabled by installing the **NVIDIA GPU Operator** cluster plugin in **sandbox / `vm-passthrough`** mode. In this mode the operator runs the `kubevirt-gpu-device-plugin` and `vfio-manager`, which bind eligible NVIDIA GPUs to the `vfio-pci` driver and advertise them as allocatable `nvidia.com/<device>` resources. KubeVirt then exposes those resources to VMs through `permittedHostDevices`.
+
+This document describes how to prepare the physical GPU passthrough environment on ACP.
+
+> **Version note:** Use the NVIDIA GPU Operator plugin package provided for your ACP release.
+
+## Constraints and Limitations
+
+- The host must support and have **IOMMU/VT-d** enabled in firmware and kernel.
+- Each VM can be assigned **one physical GPU** through the console (Physical GPU is an Alpha feature).
+- The GPU must be free of a host NVIDIA driver so it can be bound to `vfio-pci`.
+
+## Prerequisites
+
+- ACP installed, and the target cluster managed by ACP.
+- KubeVirt virtualization enabled on the target cluster (the `kubevirt-hyperconverged` HyperConverged exists in the `kubevirt` namespace).
+- At least one worker node equipped with a supported NVIDIA GPU.
+
+### Enabling IOMMU
+
+The procedure to enable IOMMU varies by operating system; refer to your OS documentation. The example below uses CentOS. Run all commands in a terminal on the GPU node.
+
+1. Edit `/etc/default/grub` and add `intel_iommu=on iommu=pt` to `GRUB_CMDLINE_LINUX`:
+
+   ```
+   GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=centos/root rhgb quiet intel_iommu=on iommu=pt"
+   ```
+
+2. Regenerate `grub.cfg`:
+
+   ```
+   grub2-mkconfig -o /boot/grub2/grub.cfg
+   ```
+
+3. Restart the server.
+
+4. Confirm IOMMU is enabled. The output should contain `IOMMU enabled`:
+
+   ```
+   dmesg | grep -i iommu
+   ```
+
+### Uninstall the NVIDIA Driver
+
+Passthrough requires the GPU to use `vfio-pci`. If a host NVIDIA driver is already installed on the GPU node, uninstall it first.
+
+## Install the GPU Operator Cluster Plugin
+
+### Prerequisites
+
+1. Obtain the NVIDIA GPU Operator plugin installation package that matches the platform, and ensure its images are available in the cluster's image repository.
+2. Use the platform's application publishing capability to publish the NVIDIA GPU Operator plugin to the target cluster.
+
+### Procedure
+
+1. In the left navigation bar, go to **Marketplace** > **Cluster Plugins**.
+2. Select the target cluster.
+3. Click the action button next to the **NVIDIA GPU Operator** plugin > **Install**.
+
+The plugin installs in `vm-passthrough` sandbox mode by default. The platform automatically renders cluster-specific values (such as the image registry address), so no extra configuration is required.
+
+### Verify the Installation
+
+1. Confirm the operator and its sandbox operands are running. The `ClusterPolicy` should report `ready`:
+
+   ```bash
+   kubectl get clusterpolicy
+   # NAME             STATUS   AGE
+   # cluster-policy   ready    1m
+
+   kubectl get pods -A | grep -E 'gpu-operator|nvidia-(vfio-manager|sandbox)'
+   ```
+
+   On a node with a supported GPU, the `nvidia-vfio-manager`, `nvidia-sandbox-device-plugin-daemonset`, and `nvidia-sandbox-validator` pods become `Running`.
+
+   > If `ClusterPolicy` reports `NoGPUNodes`, no GPU node has been detected yet. Node Feature Discovery labels GPU nodes automatically once it detects an NVIDIA PCI device (vendor `10de`); the sandbox operands are then scheduled onto those nodes.
+
+2. Identify the GPU node:
+
+   ```bash
+   kubectl get nodes -o wide
+   ```
+
+3. Verify the GPU node advertises a passthrough-capable GPU. Output similar to `nvidia.com/GK210GL_TESLA_K80` indicates passthrough-capable GPUs are available:
+
+   ```bash
+   # Replace <gpu-node-name> with the node from the previous step
+   kubectl get node <gpu-node-name> -o json | jq '.status.allocatable | with_entries(select(.key | startswith("nvidia.com/"))) | with_entries(select(.value != "0"))'
+   ```
+
+   Example output:
+
+   ```
+   {
+       "nvidia.com/GK210GL_TESLA_K80": "8"
+   }
+   ```
+
+## Configure KubeVirt
+
+1. Enable the `disableMDevConfiguration` feature gate:
+
+   ```bash
+   kubectl patch hco kubevirt-hyperconverged -n kubevirt --type='json' \
+     -p='[{"op": "add", "path": "/spec/featureGates/disableMDevConfiguration", "value": true }]'
+   ```
+
+2. On the GPU node, obtain the `pciDeviceSelector`. In the output below, `10de:102d` is the selector value:
+
+   ```bash
+   lspci -nn | grep -i nvidia
+   # 04:00.0 3D controller [0302]: NVIDIA Corporation GK210GL [Tesla K80] [10de:102d] (rev a1)
+   ```
+
+3. Obtain the `resourceName` from the GPU node's allocatable resources (for example `nvidia.com/GK210GL_TESLA_K80`):
+
+   ```bash
+   # Replace <gpu-node-name> with the GPU node name
+   kubectl get node <gpu-node-name> -o json | jq '.status.allocatable | with_entries(select(.key | startswith("nvidia.com/"))) | with_entries(select(.value != "0"))'
+   ```
+
+4. Register the passthrough GPU in `permittedHostDevices`.
+
+   > **Note:** Convert all letters in the `pciDeviceSelector` to **uppercase**. For example, `10de:102d` becomes `10DE:102D`.
+
+   - Add a single GPU model:
+
+     ```bash
+     export DEVICE=<pci-devices-id>      # e.g. 10DE:102D
+     export RESOURCE=<resource-name>     # e.g. nvidia.com/GK210GL_TESLA_K80
+
+     kubectl patch hco kubevirt-hyperconverged -n kubevirt --type='json' -p='
+     [
+       {
+         "op": "add",
+         "path": "/spec/permittedHostDevices",
+         "value": {
+           "pciHostDevices": [
+             {
+               "externalResourceProvider": true,
+               "pciDeviceSelector": "'"$DEVICE"'",
+               "resourceName": "'"$RESOURCE"'"
+             }
+           ]
+         }
+       }
+     ]'
+     ```
+
+   - Append an additional GPU model after one is already registered (`INDEX` is a zero-based array index):
+
+     ```bash
+     export DEVICE=<pci-devices-id>
+     export RESOURCE=<resource-name>
+     export INDEX=<index>               # e.g. 1 to add a second device
+
+     kubectl patch hco kubevirt-hyperconverged -n kubevirt --type='json' -p='
+     [
+       {
+         "op": "add",
+         "path": "/spec/permittedHostDevices/pciHostDevices/'"${INDEX}"'",
+         "value": {
+           "externalResourceProvider": true,
+           "pciDeviceSelector": "'"$DEVICE"'",
+           "resourceName": "'"$RESOURCE"'"
+         }
+       }
+     ]'
+     ```
+
+## Create a Virtual Machine with a Passthrough GPU
+
+After the configuration above, the physical GPU can be selected when creating a VM.
+
+1. Go to **Container Platform**.
+2. In the left navigation bar, click **Virtualization** > **Virtual Machines**.
+3. Click **Create Virtual Machine**.
+4. Configure the **Physical GPU (Alpha)** parameter:
+
+   | Parameter            | Description |
+   | -------------------- | ----------- |
+   | **Physical GPU (Alpha)** | Select the configured physical GPU model. Only one physical GPU can be assigned per VM. |
+
+If the configured GPU model can be selected during VM creation, the passthrough environment is ready.
+
+## Related Operations
+
+### Remove GPU Configuration from KubeVirt
+
+```bash
+kubectl patch hco kubevirt-hyperconverged -n kubevirt --type='json' \
+  -p='[{"op": "remove", "path": "/spec/permittedHostDevices"}]'
+```
+
+After removal, the GPU model can no longer be selected when creating a VM.
+
+### Uninstall the GPU Operator Cluster Plugin
+
+1. In the left navigation bar, go to **Marketplace** > **Cluster Plugins**.
+2. Select the target cluster.
+3. Click the action button next to the **NVIDIA GPU Operator** plugin > **Uninstall**.

From 0e9881aa476fe8f4329b1a2646e3d3396b0ab6f1 Mon Sep 17 00:00:00 2001
From: licheng <chengli@alauda.io>
Date: Wed, 24 Jun 2026 03:12:09 +0000
Subject: [PATCH 2/4] docs: add language specifiers to fenced code blocks

Tag the GRUB, grub2-mkconfig, dmesg blocks as bash and the GPU
allocatable example output as json for syntax highlighting.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 ...PU_Passthrough_for_KubeVirt_Virtual_Machines_on_ACP.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/en/solutions/Physical_GPU_Passthrough_for_KubeVirt_Virtual_Machines_on_ACP.md b/docs/en/solutions/Physical_GPU_Passthrough_for_KubeVirt_Virtual_Machines_on_ACP.md
index 0642c1d6..94bcb92d 100644
--- a/docs/en/solutions/Physical_GPU_Passthrough_for_KubeVirt_Virtual_Machines_on_ACP.md
+++ b/docs/en/solutions/Physical_GPU_Passthrough_for_KubeVirt_Virtual_Machines_on_ACP.md
@@ -37,13 +37,13 @@ The procedure to enable IOMMU varies by operating system; refer to your OS docum
 
 1. Edit `/etc/default/grub` and add `intel_iommu=on iommu=pt` to `GRUB_CMDLINE_LINUX`:
 
-   ```
+   ```bash
    GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=centos/root rhgb quiet intel_iommu=on iommu=pt"
    ```
 
 2. Regenerate `grub.cfg`:
 
-   ```
+   ```bash
    grub2-mkconfig -o /boot/grub2/grub.cfg
    ```
 
@@ -51,7 +51,7 @@ The procedure to enable IOMMU varies by operating system; refer to your OS docum
 
 4. Confirm IOMMU is enabled. The output should contain `IOMMU enabled`:
 
-   ```
+   ```bash
    dmesg | grep -i iommu
    ```
 
@@ -105,7 +105,7 @@ The plugin installs in `vm-passthrough` sandbox mode by default. The platform au
 
    Example output:
 
-   ```
+   ```json
    {
        "nvidia.com/GK210GL_TESLA_K80": "8"
    }

From 13603c10f9b4577b09c07cdf6dd673875a11b30d Mon Sep 17 00:00:00 2001
From: licheng <chengli@alauda.io>
Date: Wed, 24 Jun 2026 05:26:13 +0000
Subject: [PATCH 3/4] docs: make KubeVirt host-device patches non-destructive

- permittedHostDevices: inspect first, initialize only when unset, and
  append the GPU via pciHostDevices/- instead of overwriting the whole
  object or computing an index
- removal now targets only the GPU's pciHostDevices entry by index
- disableMDevConfiguration: add a pre-check and warning since it is a
  global feature gate affecting mediated devices / vGPU

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 ...gh_for_KubeVirt_Virtual_Machines_on_ACP.md | 46 +++++++++++++++----
 1 file changed, 36 insertions(+), 10 deletions(-)

diff --git a/docs/en/solutions/Physical_GPU_Passthrough_for_KubeVirt_Virtual_Machines_on_ACP.md b/docs/en/solutions/Physical_GPU_Passthrough_for_KubeVirt_Virtual_Machines_on_ACP.md
index 94bcb92d..70a81a41 100644
--- a/docs/en/solutions/Physical_GPU_Passthrough_for_KubeVirt_Virtual_Machines_on_ACP.md
+++ b/docs/en/solutions/Physical_GPU_Passthrough_for_KubeVirt_Virtual_Machines_on_ACP.md
@@ -113,7 +113,16 @@ The plugin installs in `vm-passthrough` sandbox mode by default. The platform au
 
 ## Configure KubeVirt
 
-1. Enable the `disableMDevConfiguration` feature gate:
+1. Enable the `disableMDevConfiguration` feature gate. It disables KubeVirt's mediated-device (mdev / vGPU) management, which is required for `vfio-pci` passthrough.
+
+   > **Warning:** `disableMDevConfiguration` is a global HCO feature gate. If the cluster already serves mediated devices or NVIDIA vGPU, enabling it disrupts that configuration. Verify first that no mediated devices are in use:
+
+   ```bash
+   kubectl get hco kubevirt-hyperconverged -n kubevirt \
+     -o jsonpath='{.spec.mediatedDevicesConfiguration}{"\n"}{.spec.permittedHostDevices.mediatedDevices}{"\n"}'
+   ```
+
+   Then enable the feature gate:
 
    ```bash
    kubectl patch hco kubevirt-hyperconverged -n kubevirt --type='json' \
@@ -134,11 +143,15 @@ The plugin installs in `vm-passthrough` sandbox mode by default. The platform au
    kubectl get node <gpu-node-name> -o json | jq '.status.allocatable | with_entries(select(.key | startswith("nvidia.com/"))) | with_entries(select(.value != "0"))'
    ```
 
-4. Register the passthrough GPU in `permittedHostDevices`.
+4. Register the passthrough GPU as a `pciHostDevices` entry. To avoid overwriting USB host devices, mediated devices, or other PCI devices that may already be configured, first inspect the current value:
+
+   ```bash
+   kubectl get hco kubevirt-hyperconverged -n kubevirt -o jsonpath='{.spec.permittedHostDevices}{"\n"}'
+   ```
 
    > **Note:** Convert all letters in the `pciDeviceSelector` to **uppercase**. For example, `10de:102d` becomes `10DE:102D`.
 
-   - Add a single GPU model:
+   - If `permittedHostDevices` is **not yet configured** (empty output above), initialize it with the GPU entry:
 
      ```bash
      export DEVICE=<pci-devices-id>      # e.g. 10DE:102D
@@ -162,18 +175,17 @@ The plugin installs in `vm-passthrough` sandbox mode by default. The platform au
      ]'
      ```
 
-   - Append an additional GPU model after one is already registered (`INDEX` is a zero-based array index):
+   - If `permittedHostDevices.pciHostDevices` **already exists**, append the GPU entry without touching the existing devices. The `-` token appends to the end of the array, so no index calculation is needed:
 
      ```bash
      export DEVICE=<pci-devices-id>
      export RESOURCE=<resource-name>
-     export INDEX=<index>               # e.g. 1 to add a second device
 
      kubectl patch hco kubevirt-hyperconverged -n kubevirt --type='json' -p='
      [
        {
          "op": "add",
-         "path": "/spec/permittedHostDevices/pciHostDevices/'"${INDEX}"'",
+         "path": "/spec/permittedHostDevices/pciHostDevices/-",
          "value": {
            "externalResourceProvider": true,
            "pciDeviceSelector": "'"$DEVICE"'",
@@ -183,6 +195,8 @@ The plugin installs in `vm-passthrough` sandbox mode by default. The platform au
      ]'
      ```
 
+     > **Note:** If `permittedHostDevices` exists but has no `pciHostDevices` array yet, first create it with `{"op": "add", "path": "/spec/permittedHostDevices/pciHostDevices", "value": []}`, then run the append above.
+
 ## Create a Virtual Machine with a Passthrough GPU
 
 After the configuration above, the physical GPU can be selected when creating a VM.
@@ -202,10 +216,22 @@ If the configured GPU model can be selected during VM creation, the passthrough
 
 ### Remove GPU Configuration from KubeVirt
 
-```bash
-kubectl patch hco kubevirt-hyperconverged -n kubevirt --type='json' \
-  -p='[{"op": "remove", "path": "/spec/permittedHostDevices"}]'
-```
+> **Warning:** Do not remove the entire `/spec/permittedHostDevices` object if the cluster also serves other host devices (USB host devices, mediated devices, or other PCI devices) — that would delete them as well. Remove only the GPU's `pciHostDevices` entry.
+
+1. List the current `pciHostDevices` to find the zero-based index of the GPU entry:
+
+   ```bash
+   kubectl get hco kubevirt-hyperconverged -n kubevirt -o jsonpath='{.spec.permittedHostDevices.pciHostDevices}{"\n"}'
+   ```
+
+2. Remove the GPU entry by its index (replace `<index>`):
+
+   ```bash
+   export INDEX=<index>
+
+   kubectl patch hco kubevirt-hyperconverged -n kubevirt --type='json' \
+     -p='[{"op": "remove", "path": "/spec/permittedHostDevices/pciHostDevices/'"${INDEX}"'"}]'
+   ```
 
 After removal, the GPU model can no longer be selected when creating a VM.
 

From fbf1275fa12b85cfd032b7fc28e1355a7b8a9b87 Mon Sep 17 00:00:00 2001
From: licheng <chengli@alauda.io>
Date: Wed, 24 Jun 2026 05:45:12 +0000
Subject: [PATCH 4/4] docs: give full command to initialize empty
 pciHostDevices array

Replace the inline JSON note with a complete kubectl patch command so
that, when permittedHostDevices exists without a pciHostDevices array,
the append step does not fail on a missing path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 ...GPU_Passthrough_for_KubeVirt_Virtual_Machines_on_ACP.md | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/docs/en/solutions/Physical_GPU_Passthrough_for_KubeVirt_Virtual_Machines_on_ACP.md b/docs/en/solutions/Physical_GPU_Passthrough_for_KubeVirt_Virtual_Machines_on_ACP.md
index 70a81a41..c814f72b 100644
--- a/docs/en/solutions/Physical_GPU_Passthrough_for_KubeVirt_Virtual_Machines_on_ACP.md
+++ b/docs/en/solutions/Physical_GPU_Passthrough_for_KubeVirt_Virtual_Machines_on_ACP.md
@@ -195,7 +195,12 @@ The plugin installs in `vm-passthrough` sandbox mode by default. The platform au
      ]'
      ```
 
-     > **Note:** If `permittedHostDevices` exists but has no `pciHostDevices` array yet, first create it with `{"op": "add", "path": "/spec/permittedHostDevices/pciHostDevices", "value": []}`, then run the append above.
+     If `permittedHostDevices` exists but has no `pciHostDevices` array yet, the append above fails because the path does not exist. Create the empty array first, then run the append:
+
+     ```bash
+     kubectl patch hco kubevirt-hyperconverged -n kubevirt --type='json' \
+       -p='[{"op": "add", "path": "/spec/permittedHostDevices/pciHostDevices", "value": []}]'
+     ```
 
 ## Create a Virtual Machine with a Passthrough GPU