From 7e62376a002d996284ffe5b2ad976b1b7a2c0738 Mon Sep 17 00:00:00 2001 From: Aleksei Sviridkin Date: Thu, 28 May 2026 21:31:33 +0300 Subject: [PATCH 1/5] docs(gpu): drop manual KubeVirt patch step now that the platform auto-wires permittedHostDevices MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Step 2 of the GPU Passthrough guide instructed operators to `kubectl edit kubevirt -n cozy-kubevirt` and hand-paste a permittedHostDevices.pciHostDevices block. cozystack/cozystack#2768 removes the need for that step: when cozystack.gpu-operator is in bundles.enabledPackages, the platform now mirrors the chosen GPU variant into the KubeVirt CR automatically — appending HostDevices to the feature-gate list and rendering a starter NVIDIA pciHostDevices table covering Hopper, Ada Lovelace, Ampere, Turing and Volta. The new step 2 documents the contract (what the platform auto-injects and why), the verification recipe, the escape hatch via .gpu.permittedHostDevices / .gpu.replaceDefaults, and the manual Package-CR override path used by operators who need overrides the bundle does not expose (driver settings, custom node selectors, validator / dcgmExporter tweaks) — in that flow they also hand-craft the matching cozystack.kubevirt Package CR. Only next/virtualization/gpu.md is updated; v1.4 and earlier describe releases that still require the manual patch and stay as-is. Assisted-By: Claude Signed-off-by: Aleksei Sviridkin --- content/en/docs/next/virtualization/gpu.md | 50 ++++++++++++++-------- 1 file changed, 32 insertions(+), 18 deletions(-) diff --git a/content/en/docs/next/virtualization/gpu.md b/content/en/docs/next/virtualization/gpu.md index bc71d894..745de72d 100644 --- a/content/en/docs/next/virtualization/gpu.md +++ b/content/en/docs/next/virtualization/gpu.md @@ -100,32 +100,46 @@ Allocatable: For example, the database entry for A10 reads `2236 GA102GL [A10]`, which results in a resource name `nvidia.com/GA102GL_A10`. {{% /alert %}} -## 2. Update the KubeVirt Custom Resource +## 2. KubeVirt is wired automatically -Next, we will update the KubeVirt Custom Resource, as documented in the -[KubeVirt user guide](https://kubevirt.io/user-guide/virtual_machines/host-devices/#listing-permitted-devices), -so that the passthrough GPUs are permitted and can be requested by a KubeVirt VM. +When `cozystack.gpu-operator` is in `bundles.enabledPackages`, Cozystack mirrors the chosen GPU variant into the `KubeVirt` Custom Resource for you. There is no `kubectl edit kubevirt` step. -Adjust the `pciVendorSelector` and `resourceName` values to match your specific GPU model. -Setting `externalResourceProvider=true` indicates that this resource is provided by an external device plugin, -in this case the `sandbox-device-plugin` which is deployed by the Operator. +Specifically, the platform injects: + +- `HostDevices` into `spec.configuration.developerConfiguration.featureGates` (current KubeVirt splits this from the `GPU` gate; the admission webhook rejects `domain.devices.hostDevices` without it). +- A starter `spec.configuration.permittedHostDevices.pciHostDevices` table covering common NVIDIA datacenter GPUs — Hopper (H100, H200), Ada Lovelace (L4, L40, L40S), Ampere (A100 PCIe/SXM, A40, A30, A10), Turing (T4), Volta (V100, V100S). PCI vendor:device pairs are stable; `resourceName` slugs follow the `__
_` convention `nvidia-sandbox-device-plugin` v25.x emits (e.g. `nvidia.com/GA102GL_A10`). `externalResourceProvider: true` is set on every entry because the resources are advertised by the sandbox plugin, not by KubeVirt's in-tree device plugin. + +Verify the resulting CR: ```bash -kubectl edit kubevirt -n cozy-kubevirt +kubectl -n cozy-kubevirt get kubevirt kubevirt -o yaml \ + | yq '.spec.configuration | {featureGates: .developerConfiguration.featureGates, permittedHostDevices: .permittedHostDevices}' ``` -example config: + +### Extending or replacing the NVIDIA defaults + +If your cluster ships a GPU not in the default table, or your `nvidia-sandbox-device-plugin` version emits a different `resourceName` (check with `kubectl describe node | grep nvidia.com/`), extend the defaults via platform values: + ```yaml - ... - spec: - configuration: - permittedHostDevices: - pciHostDevices: - - externalResourceProvider: true - pciVendorSelector: 10DE:2236 - resourceName: nvidia.com/GA102GL_A10 - ... +# Platform Package values +gpu: + # Append (default) — your entries land alongside the NVIDIA table. + # Set to true to drop the NVIDIA table entirely (useful for non-NVIDIA-only + # clusters or strict allowlists). With replaceDefaults: true and an empty + # list below, the rendered CR carries no permittedHostDevices block at all + # and the admission webhook rejects every GPU VM — supply your own list. + replaceDefaults: false + permittedHostDevices: + pciHostDevices: + - pciVendorSelector: "10DE:2236" + resourceName: nvidia.com/GA102GL_A10 + externalResourceProvider: true ``` +### Manual Package-CR override path + +If you opt out of bundle management and hand-craft a `cozystack.gpu-operator` Package CR directly (to apply overrides the bundle does not expose — driver settings, custom node selectors, validator / dcgmExporter tweaks), the platform does NOT auto-wire `HostDevices` or `permittedHostDevices` into the KubeVirt CR. In that flow, mirror the bundle behaviour by also creating a `cozystack.kubevirt` Package CR with `components.kubevirt.values.extraFeatureGates: [HostDevices]` and the appropriate `permittedHostDevices` block. The manual Package-CR override path takes precedence over the bundle render whenever both exist. + ## 3. Create a Virtual Machine We are now ready to create a VM. From e0366205c69b9e9e21047a7d6bf881780a63d57b Mon Sep 17 00:00:00 2001 From: Aleksei Sviridkin Date: Wed, 3 Jun 2026 02:20:07 +0300 Subject: [PATCH 2/5] docs(gpu): add pre-upgrade migration steps for hand-edited permittedHostDevices The bundle now owns spec.configuration.permittedHostDevices, so the first reconcile after upgrade overwrites manual kubectl-edit entries with the NVIDIA default table. Tell operators to move custom entries into .gpu.permittedHostDevices and verify each resourceName against node-advertised names before upgrading, since the default slugs (e.g. TU104GL_T4) differ from legacy names (e.g. TU104GL_TESLA_T4) and a mismatch silently rejects GPU VMs. Assisted-By: Claude Signed-off-by: Aleksei Sviridkin --- content/en/docs/next/virtualization/gpu.md | 23 ++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/content/en/docs/next/virtualization/gpu.md b/content/en/docs/next/virtualization/gpu.md index 745de72d..3e25b017 100644 --- a/content/en/docs/next/virtualization/gpu.md +++ b/content/en/docs/next/virtualization/gpu.md @@ -136,6 +136,29 @@ gpu: externalResourceProvider: true ``` +### Upgrading from a hand-edited KubeVirt CR + +Earlier Cozystack releases left `spec.configuration.permittedHostDevices` for operators to hand-edit (`kubectl edit kubevirt`). The bundle now **owns** that field: the first reconcile after the upgrade replaces your manual entries with the rendered NVIDIA default table. + +Before upgrading: + +1. Dump your current entries: + + ```bash + kubectl get kubevirt -n cozy-kubevirt -o yaml \ + | yq '.items[0].spec.configuration.permittedHostDevices' + ``` + +2. Move any custom entries into the Platform Package values under `.gpu.permittedHostDevices` (set `.gpu.replaceDefaults: true` if you want only your own list instead of appending to the NVIDIA defaults). + +3. Verify every `resourceName` against what your nodes actually advertise — the default table uses `nvidia-sandbox-device-plugin` slugs (e.g. `nvidia.com/TU104GL_T4`) that differ from legacy driver names (e.g. `TU104GL_TESLA_T4`): + + ```bash + kubectl describe node | grep nvidia.com/ + ``` + +A `resourceName` mismatch is silent until a GPU VM restarts or migrates, at which point the admission webhook rejects it. + ### Manual Package-CR override path If you opt out of bundle management and hand-craft a `cozystack.gpu-operator` Package CR directly (to apply overrides the bundle does not expose — driver settings, custom node selectors, validator / dcgmExporter tweaks), the platform does NOT auto-wire `HostDevices` or `permittedHostDevices` into the KubeVirt CR. In that flow, mirror the bundle behaviour by also creating a `cozystack.kubevirt` Package CR with `components.kubevirt.values.extraFeatureGates: [HostDevices]` and the appropriate `permittedHostDevices` block. The manual Package-CR override path takes precedence over the bundle render whenever both exist. From 3e5b50484dd35c5a04cec60cbbb0d58f82d09176 Mon Sep 17 00:00:00 2001 From: Aleksei Sviridkin Date: Fri, 5 Jun 2026 00:42:25 +0300 Subject: [PATCH 3/5] docs(gpu): make the permittedHostDevices escape hatch discoverable and portable MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add a callout that redirects operators looking for the removed `kubectl edit kubevirt` step to the `.gpu.permittedHostDevices` knob, linking the extend/replace and upgrade sections so the persistent manual path stays easy to find. Use `kubectl -o json | jq` for the verify and dump commands — matches the convention used across the rest of the docs and avoids the Go-yq vs Python-yq expression-syntax drift. Correct the resourceName slug convention to `_` with optional `__` qualifiers, and note the default table is rendered in the passthrough (vfio-pci) variant. Assisted-By: Claude Signed-off-by: Aleksei Sviridkin --- content/en/docs/next/virtualization/gpu.md | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/content/en/docs/next/virtualization/gpu.md b/content/en/docs/next/virtualization/gpu.md index 3e25b017..e556c630 100644 --- a/content/en/docs/next/virtualization/gpu.md +++ b/content/en/docs/next/virtualization/gpu.md @@ -107,15 +107,21 @@ When `cozystack.gpu-operator` is in `bundles.enabledPackages`, Cozystack mirrors Specifically, the platform injects: - `HostDevices` into `spec.configuration.developerConfiguration.featureGates` (current KubeVirt splits this from the `GPU` gate; the admission webhook rejects `domain.devices.hostDevices` without it). -- A starter `spec.configuration.permittedHostDevices.pciHostDevices` table covering common NVIDIA datacenter GPUs — Hopper (H100, H200), Ada Lovelace (L4, L40, L40S), Ampere (A100 PCIe/SXM, A40, A30, A10), Turing (T4), Volta (V100, V100S). PCI vendor:device pairs are stable; `resourceName` slugs follow the `___` convention `nvidia-sandbox-device-plugin` v25.x emits (e.g. `nvidia.com/GA102GL_A10`). `externalResourceProvider: true` is set on every entry because the resources are advertised by the sandbox plugin, not by KubeVirt's in-tree device plugin. +- A starter `spec.configuration.permittedHostDevices.pciHostDevices` table (rendered in the default `gpuOperatorVariant: default` — vfio-pci passthrough) covering common NVIDIA datacenter GPUs — Hopper (H100, H200), Ada Lovelace (L4, L40, L40S), Ampere (A100 PCIe/SXM, A40, A30, A10), Turing (T4), Volta (V100, V100S). PCI vendor:device pairs are stable; `resourceName` slugs follow what `nvidia-sandbox-device-plugin` v25.x emits — `_`, with optional `__` qualifiers appended when a model ships in several memory or form-factor variants (e.g. `nvidia.com/GA102GL_A10` for the single-SKU A10, `nvidia.com/GH100_H200_SXM_141GB` for the H200). `externalResourceProvider: true` is set on every entry because the resources are advertised by the sandbox plugin, not by KubeVirt's in-tree device plugin. Verify the resulting CR: ```bash -kubectl -n cozy-kubevirt get kubevirt kubevirt -o yaml \ - | yq '.spec.configuration | {featureGates: .developerConfiguration.featureGates, permittedHostDevices: .permittedHostDevices}' +kubectl -n cozy-kubevirt get kubevirt kubevirt -o json \ + | jq '.spec.configuration | {featureGates: .developerConfiguration.featureGates, permittedHostDevices: .permittedHostDevices}' ``` +{{% alert color="info" %}} + +**My GPU isn't in the default table — where's the old `kubectl edit kubevirt` step?** It is gone on purpose. `permittedHostDevices` is now owned by the chart template and reconciled from platform values, so any hand edit to the live CR is reverted on the next Flux/Helm reconcile. Add your card through `.gpu.permittedHostDevices` instead — see [Extending or replacing the NVIDIA defaults](#extending-or-replacing-the-nvidia-defaults) below. If you are upgrading from a release where you hand-edited the CR, follow [Upgrading from a hand-edited KubeVirt CR](#upgrading-from-a-hand-edited-kubevirt-cr) first. + +{{% /alert %}} + ### Extending or replacing the NVIDIA defaults If your cluster ships a GPU not in the default table, or your `nvidia-sandbox-device-plugin` version emits a different `resourceName` (check with `kubectl describe node | grep nvidia.com/`), extend the defaults via platform values: @@ -145,8 +151,8 @@ Before upgrading: 1. Dump your current entries: ```bash - kubectl get kubevirt -n cozy-kubevirt -o yaml \ - | yq '.items[0].spec.configuration.permittedHostDevices' + kubectl -n cozy-kubevirt get kubevirt kubevirt -o json \ + | jq '.spec.configuration.permittedHostDevices' ``` 2. Move any custom entries into the Platform Package values under `.gpu.permittedHostDevices` (set `.gpu.replaceDefaults: true` if you want only your own list instead of appending to the NVIDIA defaults). From 1b118711ff1cf571260a52ebcc42710f67716fad Mon Sep 17 00:00:00 2001 From: Aleksei Sviridkin Date: Mon, 8 Jun 2026 17:28:04 +0300 Subject: [PATCH 4/5] docs(gpu): clarify the manual override Package CR path and selector overrides MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Spell out that a standalone cozystack.kubevirt Package CR nests its values under spec.components..values — not a top-level spec.values — and show the full CR shape, so the manual override path is unambiguous. Add a note that re-pointing a card already in the NVIDIA table needs replaceDefaults rather than a second entry for the same pciVendorSelector, which KubeVirt resolves non-deterministically. Assisted-By: Claude Signed-off-by: Aleksei Sviridkin --- content/en/docs/next/virtualization/gpu.md | 25 +++++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-) diff --git a/content/en/docs/next/virtualization/gpu.md b/content/en/docs/next/virtualization/gpu.md index e556c630..87a2f88e 100644 --- a/content/en/docs/next/virtualization/gpu.md +++ b/content/en/docs/next/virtualization/gpu.md @@ -142,6 +142,8 @@ gpu: externalResourceProvider: true ``` +To **re-point** a card already in the NVIDIA table (for example to give `10DE:1EB8` a different `resourceName`), do not append a second entry for the same `pciVendorSelector` — both entries are rendered and KubeVirt resolves the duplicated selector non-deterministically. Set `replaceDefaults: true` and supply the full list you want instead. + ### Upgrading from a hand-edited KubeVirt CR Earlier Cozystack releases left `spec.configuration.permittedHostDevices` for operators to hand-edit (`kubectl edit kubevirt`). The bundle now **owns** that field: the first reconcile after the upgrade replaces your manual entries with the rendered NVIDIA default table. @@ -167,7 +169,28 @@ A `resourceName` mismatch is silent until a GPU VM restarts or migrates, at whic ### Manual Package-CR override path -If you opt out of bundle management and hand-craft a `cozystack.gpu-operator` Package CR directly (to apply overrides the bundle does not expose — driver settings, custom node selectors, validator / dcgmExporter tweaks), the platform does NOT auto-wire `HostDevices` or `permittedHostDevices` into the KubeVirt CR. In that flow, mirror the bundle behaviour by also creating a `cozystack.kubevirt` Package CR with `components.kubevirt.values.extraFeatureGates: [HostDevices]` and the appropriate `permittedHostDevices` block. The manual Package-CR override path takes precedence over the bundle render whenever both exist. +If you opt out of bundle management and hand-craft a `cozystack.gpu-operator` Package CR directly (to apply overrides the bundle does not expose — driver settings, custom node selectors, validator / dcgmExporter tweaks), the platform does NOT auto-wire `HostDevices` or `permittedHostDevices` into the KubeVirt CR. In that flow, mirror the bundle behaviour by also creating a `cozystack.kubevirt` Package CR that carries `extraFeatureGates` and the matching `permittedHostDevices` block under `spec.components.kubevirt.values` (a cozystack `Package` always nests component values under `spec.components..values`, never a top-level `spec.values`): + +```yaml +apiVersion: cozystack.io/v1alpha1 +kind: Package +metadata: + name: cozystack.kubevirt +spec: + variant: default + components: + kubevirt: + values: + extraFeatureGates: + - HostDevices + permittedHostDevices: + pciHostDevices: + - pciVendorSelector: "10DE:2236" + resourceName: nvidia.com/GA102GL_A10 + externalResourceProvider: true +``` + +The manual Package-CR override path takes precedence over the bundle render whenever both exist. ## 3. Create a Virtual Machine From 3f65e6777d368c8305b49fa56ce57785dd037df3 Mon Sep 17 00:00:00 2001 From: Aleksei Sviridkin Date: Mon, 8 Jun 2026 18:06:53 +0300 Subject: [PATCH 5/5] docs(gpu): fix the example resourceName in the upgrade note MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The upgrade step had the slug example backwards — it presented the tidy nvidia.com/TU104GL_T4 as what the sandbox device plugin emits and TU104GL_TESLA_T4 as a legacy name. It is the other way around: the plugin upper-cases the PCI-IDs name and keeps the Tesla brand, so a vanilla plugin advertises nvidia.com/TU104GL_TESLA_T4. Point operators at the slug the plugin actually generates and note it can vary by plugin build. Assisted-By: Claude Signed-off-by: Aleksei Sviridkin --- content/en/docs/next/virtualization/gpu.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/en/docs/next/virtualization/gpu.md b/content/en/docs/next/virtualization/gpu.md index 87a2f88e..7799d4e9 100644 --- a/content/en/docs/next/virtualization/gpu.md +++ b/content/en/docs/next/virtualization/gpu.md @@ -159,7 +159,7 @@ Before upgrading: 2. Move any custom entries into the Platform Package values under `.gpu.permittedHostDevices` (set `.gpu.replaceDefaults: true` if you want only your own list instead of appending to the NVIDIA defaults). -3. Verify every `resourceName` against what your nodes actually advertise — the default table uses `nvidia-sandbox-device-plugin` slugs (e.g. `nvidia.com/TU104GL_T4`) that differ from legacy driver names (e.g. `TU104GL_TESLA_T4`): +3. Verify every `resourceName` against what your nodes actually advertise. The default table carries the slug `nvidia-sandbox-device-plugin` generates from each card's PCI-IDs name (uppercased, e.g. `nvidia.com/TU104GL_TESLA_T4` for a Tesla T4), but a different plugin build or PCI-IDs snapshot can emit a different string: ```bash kubectl describe node | grep nvidia.com/