docs(operations): add containerized GPU workloads guide#555
docs(operations): add containerized GPU workloads guide#555Aleksei Sviridkin (lexfrei) wants to merge 1 commit into
Conversation
✅ Deploy Preview for cozystack ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
📝 WalkthroughWalkthroughAdds a new operations guide documenting how to run containerized GPU workloads on Cozystack management nodes using the ChangesGPU Container Workloads Documentation
Possibly related issues
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request adds a new documentation page detailing how to run containerized GPU workloads using the container variant of the cozystack.gpu-operator package. The review feedback suggests specifying the cozy-system namespace in both the kubectl patch command and the Package resource manifest to ensure they are applied to the correct namespace.
| kubectl patch packages.cozystack.io cozystack.cozystack-platform --type=json \ | ||
| -p '[{"op": "add", "path": "/spec/components/platform/values/bundles/enabledPackages/-", "value": "cozystack.gpu-operator"}]' |
There was a problem hiding this comment.
In Cozystack, the Package resources (including cozystack.cozystack-platform) are typically located in the cozy-system namespace. Running kubectl patch without specifying the namespace will fail if the user's current context is set to another namespace (like default). Adding -n cozy-system ensures the command runs successfully.
| kubectl patch packages.cozystack.io cozystack.cozystack-platform --type=json \ | |
| -p '[{"op": "add", "path": "/spec/components/platform/values/bundles/enabledPackages/-", "value": "cozystack.gpu-operator"}]' | |
| kubectl patch packages.cozystack.io cozystack.cozystack-platform -n cozy-system --type=json \\ | |
| -p '[{"op": "add", "path": "/spec/components/platform/values/bundles/enabledPackages/-", "value": "cozystack.gpu-operator"}]' |
| apiVersion: cozystack.io/v1alpha1 | ||
| kind: Package | ||
| metadata: | ||
| name: cozystack.gpu-operator | ||
| spec: | ||
| variant: container |
There was a problem hiding this comment.
The Package resource needs to be created in the cozy-system namespace for the Cozystack operator to detect and reconcile it. Adding namespace: cozy-system to the metadata ensures it is applied to the correct namespace.
| apiVersion: cozystack.io/v1alpha1 | |
| kind: Package | |
| metadata: | |
| name: cozystack.gpu-operator | |
| spec: | |
| variant: container | |
| apiVersion: cozystack.io/v1alpha1 | |
| kind: Package | |
| metadata: | |
| name: cozystack.gpu-operator | |
| namespace: cozy-system | |
| spec: | |
| variant: container |
3170d45 to
8b83e54
Compare
|
Actionable comments posted: 0 |
8b83e54 to
b9cae43
Compare
myasnikovdaniil
left a comment
There was a problem hiding this comment.
Thanks — this is a well-researched page and most of it checks out against the companion PR cozystack/cozystack#2766 and the platform chart. A few substantive items before merge.
Main blocker: the Fractional GPU sharing section directs users into a device-plugin registration conflict (see inline comment). HAMi does not reuse the operator's device plugin — it ships its own, and the auto-disable that prevents the clash only exists in the tenant kubernetes app chart, not on the management cluster. The container variant pins devicePlugin.enabled: true, so stacking cozystack.hami on top as written runs two plugins both registering nvidia.com/gpu.
Sequencing: cozystack/cozystack#2766 (which adds the container variant) is still open. This page documents a variant that doesn't exist yet — please hold merge until #2766 lands, or confirm both ship in the same release train.
Smaller accuracy/UX fixes inline. Recommendation: request changes.
|
|
||
| ## Fractional GPU sharing | ||
|
|
||
| The `container` variant exposes whole GPUs through the upstream NVIDIA device plugin. To slice one GPU across multiple pods (memory and compute quotas per pod), enable HAMi on top — HAMi reuses the same device plugin layer and is wired in via the `cozystack.hami` package, which already depends on `cozystack.gpu-operator`. See [GPU Sharing with HAMi](/docs/next/kubernetes/gpu-sharing/) for the tenant Kubernetes flow; for management-cluster workloads the wiring is the same package set with HAMi enabled. |
There was a problem hiding this comment.
- "HAMi reuses the same device plugin layer" is wrong. HAMi ships its own device plugin + scheduler extender. The page you link to states the opposite: "When HAMi is enabled, GPU Operator's built-in device plugin is automatically disabled to avoid resource registration conflicts."
- That auto-disable only lives in the tenant
kubernetesapp chart (packages/apps/kubernetes/tests/gpu_operator_hami_test.yaml— "should disable devicePlugin when hami is enabled"). The management-clustercozystack.hamiPackageSource only declaresdependsOn: cozystack.gpu-operator(install ordering);packages/system/hami/values.yamldoes not touch the operator's device plugin. - The
containervariant pinsdevicePlugin.enabled: true(values-container.yamlin #2766). Stackingcozystack.hamion top, as written, runs two device plugins both registeringnvidia.com/gpu— exactly the conflict the HAMi doc warns about.
Suggested rewrite:
The `container` variant exposes whole GPUs through the upstream NVIDIA device plugin.
For fractional sharing (per-pod memory and compute quotas), see
[GPU Sharing with HAMi](/docs/next/kubernetes/gpu-sharing/) — currently documented for
tenant Kubernetes clusters, where enabling HAMi automatically disables the GPU Operator's
built-in device plugin to avoid resource-registration conflicts. Stacking the
`cozystack.hami` package directly on top of the `container` variant on the management
cluster is not a supported combination yet: the variant pins the NVIDIA device plugin on,
and running it alongside HAMi's device plugin causes both to register `nvidia.com/gpu`.The intro at line 10 ("you can stack HAMi on top once the container variant is up") echoes the same claim and should be softened to match.
| ## Prerequisites | ||
|
|
||
| - A Cozystack management cluster with at least one GPU-enabled node. | ||
| - The GPU node runs a supported Linux distribution (Ubuntu, Debian, RHEL, Fedora, openSUSE) with the NVIDIA driver installed via the distro package manager. Verify with `nvidia-smi` over SSH or `kubectl debug node/<node-name>` — it must enumerate the physical GPUs and report a working driver version. |
There was a problem hiding this comment.
The companion PR's own OS-support table (docs/gpu-vgpu.md in #2766) only covers Ubuntu 20.04–26.04 and Talos. Cozystack's documented node-OS surface is Talos + Ubuntu/Debian (ansible path). Listing RHEL/Fedora/openSUSE as "supported" presents untested territory as fact.
- The GPU node runs Ubuntu or Debian with the NVIDIA driver installed via the distro
package manager (other distros with an equivalent driver + toolkit package layout
should work the same way but are not regularly tested). Verify with `nvidia-smi` …|
|
||
| - A Cozystack management cluster with at least one GPU-enabled node. | ||
| - The GPU node runs a supported Linux distribution (Ubuntu, Debian, RHEL, Fedora, openSUSE) with the NVIDIA driver installed via the distro package manager. Verify with `nvidia-smi` over SSH or `kubectl debug node/<node-name>` — it must enumerate the physical GPUs and report a working driver version. | ||
| - `nvidia-container-toolkit` installed on the same node and registered with containerd (`grep nvidia /etc/containerd/config.toml` shows the runtime entry). |
There was a problem hiding this comment.
apt install nvidia-container-toolkit alone does not modify containerd config — registration is a separate manual step. A reader on a fresh node will fail this grep with no pointer to the fix. Suggest spelling out the registration:
- `nvidia-container-toolkit` installed on the same node and registered with containerd:
```bash
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd
grep nvidia /etc/containerd/config.toml # must show the runtime entry|
|
||
| ```bash | ||
| kubectl apply -f cuda-smoke.yaml | ||
| kubectl logs cuda-smoke |
There was a problem hiding this comment.
Run back-to-back, kubectl logs errors while the (large) CUDA base image is still pulling. Add a wait:
kubectl apply -f cuda-smoke.yaml
kubectl wait --for=jsonpath='{.status.phase}'=Succeeded pod/cuda-smoke --timeout=5m
kubectl logs cuda-smoke| - A Cozystack management cluster with at least one GPU-enabled node. | ||
| - The GPU node runs a supported Linux distribution (Ubuntu, Debian, RHEL, Fedora, openSUSE) with the NVIDIA driver installed via the distro package manager. Verify with `nvidia-smi` over SSH or `kubectl debug node/<node-name>` — it must enumerate the physical GPUs and report a working driver version. | ||
| - `nvidia-container-toolkit` installed on the same node and registered with containerd (`grep nvidia /etc/containerd/config.toml` shows the runtime entry). | ||
| - `kubectl` configured against the management cluster. |
There was a problem hiding this comment.
Minor gotcha worth one prerequisite line: the container variant relies on the upstream default workload container for unlabeled nodes. A node still carrying nvidia.com/gpu.workload.config=vm-passthrough from the GPU Passthrough guide overrides that per-node and the device plugin won't serve it — a likely trip-up when migrating a node off the passthrough setup.
- The GPU node must not carry a `nvidia.com/gpu.workload.config` label left over from the
passthrough setup (`kubectl label node <node-name> nvidia.com/gpu.workload.config-` to remove).Document the new container variant of cozystack.gpu-operator, paired with cozystack/cozystack#2766. Covers the apt-installed-driver-and-toolkit Linux shape that the variant targets: when to pick it over the passthrough and vGPU variants, prerequisites (host driver + host nvidia-container-toolkit registered with containerd via nvidia-ctk runtime configure, validated with nvidia-smi over kubectl debug), the host-driver reuse path (driver.enabled=false, so the operator uses the pre-installed driver at its standard location with no driverInstallDir override needed on a stock apt install), the Talos caveat with a pointer to the values-native-talos.yaml reference, install steps, a sample CUDA pod for verification, the variant comparison matrix, and a note on why stacking HAMi directly on the container variant on the management cluster is not a supported combination yet (both register nvidia.com/gpu). Lands under operations/ — symmetric with virtualization/gpu.md (VM passthrough on management cluster) and kubernetes/gpu-sharing.md (HAMi in tenant Kubernetes addons). Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
b9cae43 to
f2ae9b7
Compare
|
Thanks — addressed in the latest push. HAMi (the blocker) — rewritten. You're right: HAMi ships its own device plugin, the operator-device-plugin auto-disable lives only in the tenant OS support — narrowed to Ubuntu/Debian as tested; RHEL/Fedora/openSUSE are no longer presented as supported, just "should work but not regularly tested." containerd registration — spelled out with the explicit Leftover CUDA smoke pod — added Validator path — same reframe as the code PR: dropped On the bot's namespace suggestions ( Sequencing: agreed — this should land with / after cozystack/cozystack#2766. The page is in the |
What this PR does
Add a new operations guide describing the
containervariant ofcozystack.gpu-operator— the architectural mode for containerized GPU workloads (CUDA pods, ML training, inference) on Linux GPU nodes that already ship the NVIDIA driver andnvidia-container-toolkitvia the distro package manager.The new page lands at
content/en/docs/next/operations/gpu-container-workloads.mdand rounds out the GPU documentation surface:defaultvariant).containervariant).Content covers when to pick the variant (host driver + host toolkit + a containerd-registered
nvidiaruntime prerequisite), the host-driver reuse path (driver.enabled=false, so the operator uses the pre-installed driver at its standard location with nodriverInstallDiroverride on a stock apt install), the Talos caveat with a pointer to theexamples/values-native-talos.yamlreference, install steps withPackageCRvariant: container, a sample CUDA pod for verification, why stacking HAMi directly on this variant is not supported yet, and a three-row variant comparison matrix.Companion to cozystack/cozystack#2766, which adds the
containervariant itself.Release note
Summary by CodeRabbit