Skip to content

docs(operations): add containerized GPU workloads guide#555

Open
Aleksei Sviridkin (lexfrei) wants to merge 1 commit into
mainfrom
feat/gpu-container-workloads-docs
Open

docs(operations): add containerized GPU workloads guide#555
Aleksei Sviridkin (lexfrei) wants to merge 1 commit into
mainfrom
feat/gpu-container-workloads-docs

Conversation

@lexfrei

@lexfrei Aleksei Sviridkin (lexfrei) commented May 28, 2026

Copy link
Copy Markdown
Contributor

What this PR does

Add a new operations guide describing the container variant of cozystack.gpu-operator — the architectural mode for containerized GPU workloads (CUDA pods, ML training, inference) on Linux GPU nodes that already ship the NVIDIA driver and nvidia-container-toolkit via the distro package manager.

The new page lands at content/en/docs/next/operations/gpu-container-workloads.md and rounds out the GPU documentation surface:

Content covers when to pick the variant (host driver + host toolkit + a containerd-registered nvidia runtime prerequisite), the host-driver reuse path (driver.enabled=false, so the operator uses the pre-installed driver at its standard location with no driverInstallDir override on a stock apt install), the Talos caveat with a pointer to the examples/values-native-talos.yaml reference, install steps with Package CR variant: container, a sample CUDA pod for verification, why stacking HAMi directly on this variant is not supported yet, and a three-row variant comparison matrix.

Companion to cozystack/cozystack#2766, which adds the container variant itself.

Release note

docs(operations): add guide for containerized GPU workloads via the gpu-operator `container` variant.

Summary by CodeRabbit

  • Documentation
    • New guide for running containerized GPU workloads on cluster nodes: prerequisites, installation via the Package CR, explicit warning against using bundles.enabledPackages for this variant, operator health and GPU allocatable verification, sample CUDA Pod workflow, fractional GPU sharing via HAMi, and a comparison of container, default (VM passthrough), and vGPU variants.

@netlify

netlify Bot commented May 28, 2026

Copy link
Copy Markdown

Deploy Preview for cozystack ready!

Name Link
🔨 Latest commit f2ae9b7
🔍 Latest deploy log https://app.netlify.com/projects/cozystack/deploys/6a26cec5f38a100008e4fbb0
😎 Deploy Preview https://deploy-preview-555--cozystack.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a new operations guide documenting how to run containerized GPU workloads on Cozystack management nodes using the cozystack.gpu-operator container variant, including prerequisites, Package CR installation, health checks, CUDA smoke-test, HAMi fractional-sharing notes, and a variant comparison table.

Changes

GPU Container Workloads Documentation

Layer / File(s) Summary
GPU container variant guide
content/en/docs/next/operations/gpu-container-workloads.md
New operations guide explains when to use the container variant (host has NVIDIA driver and nvidia-container-toolkit), installation prerequisites, Package CR setup with warnings against bundles.enabledPackages, operator health verification, nvidia.com/gpu allocatable checks, a CUDA smoke-test Pod example, HAMi fractional-sharing guidance, and a variant comparison table.

Possibly related issues

  • cozystack/cozystack#2764: Directly addresses the same cozystack.gpu-operator container variant documentation and configuration guidance referenced in this PR.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped the docs to share the way,
Containers meet GPUs by light of day,
Drivers checked, CUDA pods take flight,
HAMi whispers fractional delight,
A tiny guide to make workloads play.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'docs(operations): add containerized GPU workloads guide' directly and clearly summarizes the main change: adding a new documentation page for containerized GPU workloads, which matches the added content perfectly.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/gpu-container-workloads-docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new documentation page detailing how to run containerized GPU workloads using the container variant of the cozystack.gpu-operator package. The review feedback suggests specifying the cozy-system namespace in both the kubectl patch command and the Package resource manifest to ensure they are applied to the correct namespace.

Comment on lines +36 to +37
kubectl patch packages.cozystack.io cozystack.cozystack-platform --type=json \
-p '[{"op": "add", "path": "/spec/components/platform/values/bundles/enabledPackages/-", "value": "cozystack.gpu-operator"}]'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In Cozystack, the Package resources (including cozystack.cozystack-platform) are typically located in the cozy-system namespace. Running kubectl patch without specifying the namespace will fail if the user's current context is set to another namespace (like default). Adding -n cozy-system ensures the command runs successfully.

Suggested change
kubectl patch packages.cozystack.io cozystack.cozystack-platform --type=json \
-p '[{"op": "add", "path": "/spec/components/platform/values/bundles/enabledPackages/-", "value": "cozystack.gpu-operator"}]'
kubectl patch packages.cozystack.io cozystack.cozystack-platform -n cozy-system --type=json \\
-p '[{"op": "add", "path": "/spec/components/platform/values/bundles/enabledPackages/-", "value": "cozystack.gpu-operator"}]'

Comment on lines +43 to +48
apiVersion: cozystack.io/v1alpha1
kind: Package
metadata:
name: cozystack.gpu-operator
spec:
variant: container

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Package resource needs to be created in the cozy-system namespace for the Cozystack operator to detect and reconcile it. Adding namespace: cozy-system to the metadata ensures it is applied to the correct namespace.

Suggested change
apiVersion: cozystack.io/v1alpha1
kind: Package
metadata:
name: cozystack.gpu-operator
spec:
variant: container
apiVersion: cozystack.io/v1alpha1
kind: Package
metadata:
name: cozystack.gpu-operator
namespace: cozy-system
spec:
variant: container

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Actionable comments posted: 0

@myasnikovdaniil myasnikovdaniil left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — this is a well-researched page and most of it checks out against the companion PR cozystack/cozystack#2766 and the platform chart. A few substantive items before merge.

Main blocker: the Fractional GPU sharing section directs users into a device-plugin registration conflict (see inline comment). HAMi does not reuse the operator's device plugin — it ships its own, and the auto-disable that prevents the clash only exists in the tenant kubernetes app chart, not on the management cluster. The container variant pins devicePlugin.enabled: true, so stacking cozystack.hami on top as written runs two plugins both registering nvidia.com/gpu.

Sequencing: cozystack/cozystack#2766 (which adds the container variant) is still open. This page documents a variant that doesn't exist yet — please hold merge until #2766 lands, or confirm both ship in the same release train.

Smaller accuracy/UX fixes inline. Recommendation: request changes.


## Fractional GPU sharing

The `container` variant exposes whole GPUs through the upstream NVIDIA device plugin. To slice one GPU across multiple pods (memory and compute quotas per pod), enable HAMi on top — HAMi reuses the same device plugin layer and is wired in via the `cozystack.hami` package, which already depends on `cozystack.gpu-operator`. See [GPU Sharing with HAMi](/docs/next/kubernetes/gpu-sharing/) for the tenant Kubernetes flow; for management-cluster workloads the wiring is the same package set with HAMi enabled.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ This HAMi claim is incorrect and would lead users into a resource conflict.

  • "HAMi reuses the same device plugin layer" is wrong. HAMi ships its own device plugin + scheduler extender. The page you link to states the opposite: "When HAMi is enabled, GPU Operator's built-in device plugin is automatically disabled to avoid resource registration conflicts."
  • That auto-disable only lives in the tenant kubernetes app chart (packages/apps/kubernetes/tests/gpu_operator_hami_test.yaml"should disable devicePlugin when hami is enabled"). The management-cluster cozystack.hami PackageSource only declares dependsOn: cozystack.gpu-operator (install ordering); packages/system/hami/values.yaml does not touch the operator's device plugin.
  • The container variant pins devicePlugin.enabled: true (values-container.yaml in #2766). Stacking cozystack.hami on top, as written, runs two device plugins both registering nvidia.com/gpu — exactly the conflict the HAMi doc warns about.

Suggested rewrite:

The `container` variant exposes whole GPUs through the upstream NVIDIA device plugin.
For fractional sharing (per-pod memory and compute quotas), see
[GPU Sharing with HAMi](/docs/next/kubernetes/gpu-sharing/) — currently documented for
tenant Kubernetes clusters, where enabling HAMi automatically disables the GPU Operator's
built-in device plugin to avoid resource-registration conflicts. Stacking the
`cozystack.hami` package directly on top of the `container` variant on the management
cluster is not a supported combination yet: the variant pins the NVIDIA device plugin on,
and running it alongside HAMi's device plugin causes both to register `nvidia.com/gpu`.

The intro at line 10 ("you can stack HAMi on top once the container variant is up") echoes the same claim and should be softened to match.

## Prerequisites

- A Cozystack management cluster with at least one GPU-enabled node.
- The GPU node runs a supported Linux distribution (Ubuntu, Debian, RHEL, Fedora, openSUSE) with the NVIDIA driver installed via the distro package manager. Verify with `nvidia-smi` over SSH or `kubectl debug node/<node-name>` — it must enumerate the physical GPUs and report a working driver version.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The companion PR's own OS-support table (docs/gpu-vgpu.md in #2766) only covers Ubuntu 20.04–26.04 and Talos. Cozystack's documented node-OS surface is Talos + Ubuntu/Debian (ansible path). Listing RHEL/Fedora/openSUSE as "supported" presents untested territory as fact.

- The GPU node runs Ubuntu or Debian with the NVIDIA driver installed via the distro
  package manager (other distros with an equivalent driver + toolkit package layout
  should work the same way but are not regularly tested). Verify with `nvidia-smi`


- A Cozystack management cluster with at least one GPU-enabled node.
- The GPU node runs a supported Linux distribution (Ubuntu, Debian, RHEL, Fedora, openSUSE) with the NVIDIA driver installed via the distro package manager. Verify with `nvidia-smi` over SSH or `kubectl debug node/<node-name>` — it must enumerate the physical GPUs and report a working driver version.
- `nvidia-container-toolkit` installed on the same node and registered with containerd (`grep nvidia /etc/containerd/config.toml` shows the runtime entry).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apt install nvidia-container-toolkit alone does not modify containerd config — registration is a separate manual step. A reader on a fresh node will fail this grep with no pointer to the fix. Suggest spelling out the registration:

- `nvidia-container-toolkit` installed on the same node and registered with containerd:

  ```bash
  sudo nvidia-ctk runtime configure --runtime=containerd
  sudo systemctl restart containerd
  grep nvidia /etc/containerd/config.toml   # must show the runtime entry


```bash
kubectl apply -f cuda-smoke.yaml
kubectl logs cuda-smoke

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run back-to-back, kubectl logs errors while the (large) CUDA base image is still pulling. Add a wait:

kubectl apply -f cuda-smoke.yaml
kubectl wait --for=jsonpath='{.status.phase}'=Succeeded pod/cuda-smoke --timeout=5m
kubectl logs cuda-smoke

- A Cozystack management cluster with at least one GPU-enabled node.
- The GPU node runs a supported Linux distribution (Ubuntu, Debian, RHEL, Fedora, openSUSE) with the NVIDIA driver installed via the distro package manager. Verify with `nvidia-smi` over SSH or `kubectl debug node/<node-name>` — it must enumerate the physical GPUs and report a working driver version.
- `nvidia-container-toolkit` installed on the same node and registered with containerd (`grep nvidia /etc/containerd/config.toml` shows the runtime entry).
- `kubectl` configured against the management cluster.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor gotcha worth one prerequisite line: the container variant relies on the upstream default workload container for unlabeled nodes. A node still carrying nvidia.com/gpu.workload.config=vm-passthrough from the GPU Passthrough guide overrides that per-node and the device plugin won't serve it — a likely trip-up when migrating a node off the passthrough setup.

- The GPU node must not carry a `nvidia.com/gpu.workload.config` label left over from the
  passthrough setup (`kubectl label node <node-name> nvidia.com/gpu.workload.config-` to remove).

Document the new container variant of cozystack.gpu-operator, paired with
cozystack/cozystack#2766. Covers the apt-installed-driver-and-toolkit
Linux shape that the variant targets: when to pick it over the
passthrough and vGPU variants, prerequisites (host driver + host
nvidia-container-toolkit registered with containerd via
nvidia-ctk runtime configure, validated with nvidia-smi over
kubectl debug), the host-driver reuse path (driver.enabled=false, so the
operator uses the pre-installed driver at its standard location with no
driverInstallDir override needed on a stock apt install), the Talos
caveat with a pointer to the values-native-talos.yaml reference, install
steps, a sample CUDA pod for verification, the variant comparison
matrix, and a note on why stacking HAMi directly on the container
variant on the management cluster is not a supported combination yet
(both register nvidia.com/gpu).

Lands under operations/ — symmetric with virtualization/gpu.md (VM
passthrough on management cluster) and kubernetes/gpu-sharing.md (HAMi
in tenant Kubernetes addons).

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
@lexfrei Aleksei Sviridkin (lexfrei) force-pushed the feat/gpu-container-workloads-docs branch from b9cae43 to f2ae9b7 Compare June 8, 2026 14:16
@lexfrei

Copy link
Copy Markdown
Contributor Author

Thanks — addressed in the latest push.

HAMi (the blocker) — rewritten. You're right: HAMi ships its own device plugin, the operator-device-plugin auto-disable lives only in the tenant kubernetes app chart, and sources/hami.yaml only declares dependsOn for ordering. The page now says stacking cozystack.hami directly on the container variant on the management cluster is not supported yet (both register nvidia.com/gpu), and the intro line is softened to match.

OS support — narrowed to Ubuntu/Debian as tested; RHEL/Fedora/openSUSE are no longer presented as supported, just "should work but not regularly tested."

containerd registration — spelled out with the explicit nvidia-ctk runtime configure --runtime=containerd + restart + grep block.

Leftover nvidia.com/gpu.workload.config label — added as a prerequisite with the removal command.

CUDA smoke pod — added kubectl wait --for=jsonpath='{.status.phase}'=Succeeded before kubectl logs.

Validator path — same reframe as the code PR: dropped /host/usr/bin/nvidia-smi, now "host driver at its standard location, no driverInstallDir override on apt."

On the bot's namespace suggestions (-n cozy-system / namespace: cozy-system on the Package CR): left out deliberately — Cozystack's own canonical examples (packages/core/installer/example/platform.yaml, examples/values-native-talos.yaml) create Package CRs with no namespace, so adding one would diverge from the shipped convention. The current doc uses kubectl apply -f, not kubectl patch, so that suggestion doesn't apply either.

Sequencing: agreed — this should land with / after cozystack/cozystack#2766. The page is in the next/ tree so it tracks the unreleased variant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants