Skip to content

Draft: docs(training-hub): QLoRA + CPT guides + e2e (8.2.3)#270

Draft
typhoonzero wants to merge 24 commits into
masterfrom
loop/2026-06-22-traininghub-qlora-cpt
Draft

Draft: docs(training-hub): QLoRA + CPT guides + e2e (8.2.3)#270
typhoonzero wants to merge 24 commits into
masterfrom
loop/2026-06-22-traininghub-qlora-cpt

Conversation

@typhoonzero

Copy link
Copy Markdown
Contributor

Daily-loop dev (2026-06-22). Adds runnable QLoRA + CPT tutorials, training-hub-fine-tuning.mdx sections, e2e cases c13/c14. GPU smoke SKIPped (A30 saturated). See .docs/loop/worklog-2026-06-22.md. Draft.

typhoonzero and others added 24 commits June 9, 2026 10:23
- Narrow scope to Claude Code only; remove opencode and Codex CLI sections
- Add how to configure reasoning effort when starting the InferenceService
  (server-side --reasoning-effort flag and request-time override)
- Update Claude Code section with corrected proxy setup for LiteLLM and
  claude-code-router (config-driven, ccr code startup command)
- Qwen3.6 and Gemma 4 recommendations and Unsloth quantized model list
  already present; no change needed
The flag does not exist in vLLM. Replaced with accurate guidance about
server-wide control via --chat-template and request-level parameters.
- Remove list preceding code block to avoid remark-lint-code-block-split-list
- Replace Python dict literals with dict() constructor to avoid JSX parsing
The pipelines-mlflow-integration example did not run as written. Fixes
verified against MLflow + KFP on g1-c1-x86:

- Import mlflow inside each @dsl.component (KFP v2 packages components from
  their own source; a module-level import raises NameError at runtime).
- Replace dsl.RUN_ID_PLACEHOLDER (removed in KFP v2) with
  dsl.PIPELINE_JOB_ID_PLACEHOLDER, passed in as a component argument.
- Document the secured-install access path: the mlflow-tracking-server
  Service fronts oauth2-proxy (302s headless clients), so components need a
  direct in-cluster Service, a ServiceAccount bearer token
  (MLFLOW_TRACKING_TOKEN), workspace RBAC, and a warm-up retry.
- Fix the Trainer v2 example (trainer.kubeflow.org/v1alpha1 TrainJob with
  runtimeRef/trainer, not TrainingJob/v1 with a raw pod template).
- Fix client.get_run_id -> run.run_id and the Tools menu path.

Also:
- Drop files unrelated to this PR's scope (agentic_mlops index + nav row,
  qwen3 finetune notebook) carried in from the coding-agents base branch.
- Remove dead _retry_kubectl_stdin_novalidate() from e2e/lib.sh.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ethod

Cross-checked against mlflow-plugin/mlflow-kubernetes-plugins:

- Name the canonical mechanism: the server's `kubernetes-auth` plugin
  authorizes via Kubernetes RBAC and accepts a ServiceAccount bearer token
  (Authorization / X-Forwarded-Access-Token) + X-MLFLOW-WORKSPACE.
- Fix caller RBAC resources to the plugin's API group set
  (experiments / datasets / registeredmodels); `runs` is not a resource
  (run writes authorize against `experiments`).
- Add the canonical out-of-cluster token path
  (`kubectl create token`) alongside the in-pod projected token.
- Document workspace selection via set_workspace() / MLFLOW_WORKSPACE.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per mlflow-plugin/mlflow-kubernetes-plugins/docs/authorization-plugin.md:

- Lead with the identity-token method: the server's `kubernetes-auth`
  plugin (user_identity_token mode) authenticates the caller from the bearer
  token's identity claims, authorizes that identity, and records it as the
  MLflow run owner. The client authenticates with the token before any API
  call.
- Note the credential is a Kubernetes ServiceAccount token (the
  platform-wide `kubectl create token` pattern; sub claim is the identity).
- Add a security warning: because user_identity_token reads claims
  unverified (the oauth2-proxy is the verifier), a direct endpoint must be
  network-restricted / not exposed via ingress, or run the server in
  self_subject_access_review mode.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e test

Reworks the KFP + MLflow guide to authenticate with a platform user identity
token only — no ServiceAccount, no per-workspace RBAC, no extra in-cluster
Service:

- The MLflow kubernetes-auth plugin (user_identity_token mode) takes the caller
  identity from the bearer token's claims and records it as the run owner.
- Components reach MLflow through the platform Kubernetes API
  (…/kubernetes/<cluster>/…/pods/<pod>:5000/proxy/…) and forward identity via
  X-Forwarded-Access-Token; the shipped Service only exposes the browser OAuth
  proxy, so this avoids it without creating anything.
- Removed the direct-Service, ServiceAccount-token, and RBAC sections.
- KFP example now uses a stdlib REST helper (no mlflow SDK install needed) and
  passes the token as a parameter (source from a Secret).

Adds e2e/mlflow-user-identity-smoke.sh: logs a run with a user token and asserts
the run owner equals the token identity. Verified on g1-c1-x86 (run owner
admin@cpaas.io); the pipeline example compiles with kfp 2.11.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
New how_to/mlflow-python-sdk.mdx: how to drive the stock mlflow>=3.10 SDK
against the auth + multi-tenant Alauda AI MLflow server with a platform user
identity token — no ServiceAccount, no per-workspace RBAC, no extra Service.
Covers MLFLOW_TRACKING_TOKEN auth, mlflow.set_workspace, the port-forward
connection to the app port (raw tunnel preserves Authorization), model
registry, the smoke test, and troubleshooting (302 / token-newline / 401 /
403). Verified on g1-c1-x86: runs are owned by the token identity.

Cross-linked from mlflow.mdx Client Configuration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cess)

Rework mlflow-python-sdk.mdx so the MLflow Python client always goes through
the oauth2-proxy (the platform MLflow route) instead of port-forwarding to the
container port:

- Interactive: present the browser SSO session — copy the _oauth2_proxy cookie
  and attach it via a runtime-registered RequestHeaderProvider (verified: the
  provider injects the header and the run is owned by the caller identity).
- Headless/automation: admin enables oauth2-proxy --skip-jwt-bearer-tokens, then
  the client uses MLFLOW_TRACKING_TOKEN with a platform OIDC token.

Removes the kubectl port-forward / app-port connection entirely.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- SDK guide "Headless / automation": mint a short-lived Dex id token from a
  long-lived refresh token (refresh-token grant at /dex/token), then use it as
  MLFLOW_TRACKING_TOKEN through the OAuth proxy. Refresh before the 24h id-token
  expiry instead of carrying a static token.
- Rework the smoke test to the same method: refresh token -> id token -> log to
  MLflow via the platform route (through oauth2-proxy, no container-port access),
  asserting the run owner equals the token identity. Requires the proxy's
  --skip-jwt-bearer-tokens.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- SDK guide "Headless / automation": mint a Dex id token with the OAuth2
  password grant (grant_type=password at /dex/token) — one call, no browser/
  cookie — then use it as MLFLOW_TRACKING_TOKEN through the OAuth proxy.
  Requires a Dex client whose grantTypes include "password" + the proxy's
  --skip-jwt-bearer-tokens. Warns to use a dedicated service account (ROPC
  sends the password) and store creds in a Secret.
- Rework the smoke test to ROPC: username/password -> Dex id token -> log to
  MLflow via the platform route (through oauth2-proxy), asserting run owner ==
  token identity.

Verified ROPC mints a valid Dex id token (iss=dex, aud=alauda-auth, key in
Dex JWKS) on g1-c1-x86.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mlflow-python-sdk.mdx now leads with the OAuth2 password grant: mint a Dex id
token from a username/password at /dex/token, then use it as
MLFLOW_TRACKING_TOKEN through the OAuth proxy. Adds an admin "Platform setup"
section (--skip-jwt-bearer-tokens + a password-grant Dex client). The browser
session-cookie flow is kept as a secondary "interactive alternative".

Verified end-to-end on g1-c1-x86 (run owner = the token's user identity).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- SDK guide: set_tracking_uri now uses the in-cluster Service
  http://mlflow-tracking-server.kubeflow:5000 (still via the OAuth proxy) for
  in-cluster clients; note the platform route for outside-the-cluster use.
- Pipelines guide: rewritten to use the MLflow Python client against the
  in-cluster Service with MLFLOW_TRACKING_TOKEN injected from a Secret
  (kfp-kubernetes use_secret_as_env), and reference the SDK guide for auth/RBAC
  and minting the token (password grant). Drops the raw-REST/container-port
  helper. Trainer v2 example points MLFLOW_TRACKING_URI at the in-cluster
  Service. Example compiles with kfp 2.11 + kfp-kubernetes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The MLflow usage docs under training_guides now point to
how_to/mlflow-python-sdk.mdx for authentication (MLFLOW_TRACKING_TOKEN) and
workspace/RBAC on secured installs, where the bare MLFLOW_TRACKING_URI /
report_to: mlflow setup is not sufficient:

- fine-tuning-using-notebooks.mdx (Experiment tracking sections)
- fine-tune-with-trainer-v2.ipynb (Step 5: View Training Metrics in MLflow)

Also corrects the menu path to Alauda AI -> Tools -> MLFlow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…roxy's own Dex client

A dedicated Dex client cannot be used for the password grant on this
platform: the OAuth proxy validates that the token audience equals its
own client_id, so a separate client's token is rejected at the proxy.
Document enabling `password` in the grantTypes of the proxy's own
OAuth2Client (verified against the live cluster), with the kubectl
patch, the aud constraint, and a security caveat. Update the matching
troubleshooting row.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ookie)

ROPC needs the password grant on the shared alauda-auth client, i.e. a
change to the global auth server — which is off-limits. The platform
already allows the authorization_code grant, and its login API is
scriptable (PKCE; captcha is retry-gated, so a clean first login needs
none). Rewrite the SDK guide around two browser-free methods, both
verified end-to-end on g1-c1-x86:

- Bearer token (primary): scripted authorization_code+PKCE -> id_token
  as MLFLOW_TRACKING_TOKEN, renewed via the refresh_token grant. Needs
  --skip-jwt-bearer-tokens on the MLflow proxy (workload cluster, not
  global auth). Python helper + curl; both verified.
- Session cookie (fallback): same scripted login fed to the proxy
  callback -> _oauth2_proxy cookie. Zero platform changes.

Point pipelines-mlflow-integration at the SDK guide's token flow instead
of the password grant (and fix the renamed platform-setup anchor).
Rewrite the e2e smoke test to exercise both legs (token leg SKIPs
cleanly when skip-jwt is off) and fix a cleanup bug where the
_oauth2_proxy cookie value contains '|', which collided with the
delimiter and leaked experiments.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…l -> 302 -> follows redirect to platform HTTPS)

The MLflow SDK reports SSLCertVerificationError when the proxy rejects the
credential: it 302s to the login page and the client follows it to the
platform's self-signed HTTPS endpoint. Document the real cause (fix the
credential, not the TLS) and note the in-cluster http:// Service URL plus
MLFLOW_TRACKING_INSECURE_TLS for the external route in the cookie section.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Close the L0 8.2.3 gaps (QLoRA, continued pre-training) in the Training Hub
corpus, matching the existing SFT/OSFT tutorial style. APIs verified against the
real traininghub0.1-cu126-amd64 v0.1.0 runtime image on g1-c1-x86:
  - QLoRA  -> training_hub.lora_sft(load_in_4bit=True, bnb_4bit_quant_type="nf4")
  - CPT    -> training_hub.sft(is_pretraining=True, block_size, document_column_name)

- training-hub-fine-tuning.mdx: algorithm table now covers SFT/OSFT/QLoRA/CPT;
  new "QLoRA (4-bit LoRA)" and "Continued pre-training (CPT)" sections
  (CUDA-first, with Ascend NPU notes); notebooks added to the examples table.
- qlora-comprehensive-tutorial.ipynb / cpt-comprehensive-tutorial.ipynb:
  runnable comprehensive notebooks mirroring the SFT/OSFT tutorials.
- e2e cases c13 (QLoRA) and c14 (CPT): self-contained (synthetic tiny Qwen2 +
  synthetic data, no model/corpus download). c13 drives lora_sft 4-bit QLoRA
  with a trl+peft+bitsandbytes fallback and an sm_75 arch guard; c14 drives
  sft(is_pretraining=True). Both SKIP (rc=77) with the captured scheduler event
  when no GPU slice is schedulable.
- run_all.sh: wire C13 (active); C14 left commented (like C4) until GPU frees up.

E2E smoke (g1-c1-x86, ns mlops-demo-e2e): c13 attempted for real -> SKIP(77):
the only Ampere+ GPU (A30, sm_80) is 100% reserved by a persistent 27B inference
pod (gpumem=24k/gpucores=100), and the only free GPU is a P100 (sm_60, no
bitsandbytes 4-bit). Scheduler: FailedScheduling / CardInsufficientMemory.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ee0f99a1-215b-47af-b493-745a3b9dc473

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch loop/2026-06-22-traininghub-qlora-cpt

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants