docs: add inference_guide with validated 7B+ models (Ascend NPU) by EdisonSu768 · Pull Request #268 · alauda/aml-docs

EdisonSu768 · 2026-06-18T06:52:08Z

What

New docs/en/inference_guide/ section documenting already-validated open-weight LLM inference, mirroring the structure of training_guides/training-runtimes. All content is derived from verified deployments + benchmarks (no fabricated numbers).

Models (2, ≥7B/8B) ✅
Runtime images (GPU/NPU) — NPU vLLM-Ascend + MindIE catalog, with an NVIDIA GPU note ✅
Runtime YAML examples — namespace-scoped ServingRuntime + InferenceService (not ClusterServingRuntime), one per model/engine/TP combination ✅
Model + runtime + image benchmark results — measured guidellm open-loop per-replica numbers (4 workloads × rate 1–9), with the dense→vLLM / MoE→MindIE engine-selection finding ✅

Files

docs/en/inference_guide/
├── index.mdx                         # overview, runtime-image catalog, engine selection, methodology, deploy steps
├── qwen3-14b.mdx                     # dense model card + benchmark tables
├── qwen3-30b-a3b.mdx                 # MoE model card + benchmark tables
└── assets/
    ├── qwen3-14b/qwen3-14b-vllm-ascend-tp1.yaml
    ├── qwen3-14b/qwen3-14b-vllm-ascend-tp2.yaml
    ├── qwen3-30b-a3b/qwen3-30b-a3b-vllm-ascend-tp4.yaml
    └── qwen3-30b-a3b/qwen3-30b-a3b-mindie-tp4.yaml

Scope note

Every validated 7B+ model currently runs on Ascend 910B NPU; there is no verified GPU benchmark at this size yet (only Qwen3.5-0.8B on A30). The doc therefore leads with NPU and lists the NVIDIA GPU runtime as the platform default with that gap flagged, rather than inventing GPU numbers.

Verification

yarn lint → 0 errors / 0 warnings
yarn build → all 3 pages render
All 4 asset YAMLs parse as valid multi-doc YAML

🤖 Generated with Claude Code

Summary by CodeRabbit

Documentation
- Updated inference guide with validated Qwen3.6-27B (W8A8) deployment configuration on Huawei Ascend 910B4.
- Added new deployment guide with benchmark methodology and performance results.
- Updated runtime image information and deployment examples.
- Enhanced guidance for multi-card tensor parallelism setup.

New `inference_guide/` section documenting already-validated open-weight LLM inference, mirroring the structure of `training_guides/training-runtimes`: - Two validated models above the 7B/8B class: - Qwen3-14B (dense, BF16) — recommended engine vLLM-Ascend - Qwen3-30B-A3B (MoE, BF16) — recommended engine MindIE - Runtime images catalog (NPU vLLM-Ascend + MindIE; NVIDIA GPU note) - Per-model namespace-scoped `ServingRuntime` + `InferenceService` assets (not ClusterServingRuntime), one per engine/TP combination - Measured open-loop per-replica benchmark tables (guidellm, 4 workloads, rate 1-9) with the dense→vLLM / MoE→MindIE engine-selection finding Lint and build pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-18T06:52:25Z

Walkthrough

The inference guide is updated from Qwen3-30B-A3B to Qwen3.6-27B (W8A8) on Huawei Ascend 910B4. A new KServe YAML asset defines the ServingRuntime and InferenceService with TP=2 × 4 replicas. A new model-specific MDX page and updates to the guide index cover validated hardware, runtime image tags, benchmark workloads, deployment instructions, and caveats.

Changes

Inference Guide: Qwen3.6-27B (W8A8) on Ascend 910B4

Layer / File(s)	Summary
ServingRuntime + InferenceService YAML `docs/en/inference_guide/assets/qwen3-6-27b-w8a8/qwen3-6-27b-w8a8-vllm-ascend-tp2x4.yaml`	Adds the full KServe deployment manifest: a ServingRuntime wrapping `vllm serve` with a bash entrypoint that sources Ascend env scripts and derives `MODEL_PATH` from annotations, plus an InferenceService with 4 replicas (TP=2 each), PVC-backed weights, W8A8 quantization, no-prefix-caching, speculative decoding, CUDAGraph `FULL_DECODE_ONLY`, and Ascend910 resource requests/limits.
Inference guide index page `docs/en/inference_guide/index.mdx`	Rewrites the page intro, replaces the validated models table with Qwen3.6-27B (W8A8) on Ascend 910B4 ×8, updates runtime image tags with CANN-match and MindIE `qwen3_5` support notes, redefines benchmark workloads to three token-size pairs, updates the deploy walkthrough and curl example to the new model/asset path, and revises caveats to Ascend 910B4 resource keys.
Qwen3.6-27B (W8A8) model page `docs/en/inference_guide/qwen3-6-27b-w8a8.mdx`	New model documentation page with identity metadata (architecture, parameter counts, BF16 vs W8A8, HF/ModelScope links), hardware × stack compatibility table (vLLM-Ascend nightly supported, MindIE unsupported), deploy section with TP=2 × 4 topology and `max-num-seqs 32` guardrail, and benchmark results tables for Chat/Code/RAG workloads comparing Direct vs Gateway ingress.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

alauda/aml-docs#234: Shares documentation of Modelcar permission modes and multi-card TP>1 HCCL initialization requirements for vLLM-Ascend on Huawei Ascend 910B4, both of which are referenced in the updated Caveats section.

Poem

🐇 A new model hops into the guide,
W8A8 weights, eight Ascend cards wide,
TP=2 times four — replicas in a row,
vllm serve starts with an Ascend env flow,
Benchmarks logged, the gateway path too,
Qwen3.6-27B, validated and true! 🌸

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding inference guide documentation with validated models (Qwen3.6-27B W8A8) for Ascend NPU deployment.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch docs/inference-guide-validated-models

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/en/inference_guide/assets/qwen3-30b-a3b/qwen3-30b-a3b-mindie-tp4.yaml`:
- Around line 60-107: The bash script in the diff lacks strict error handling
mode, which means failed commands like `source` will be silently ignored and the
script will continue executing with a potentially incomplete configuration. Add
strict mode directives (set -e and optionally set -u and set -o pipefail)
immediately after the shebang and before the help() function definition to
ensure the script fails immediately if any command fails, preventing
mindieservice_daemon from starting with partial or broken configuration.

In `@docs/en/inference_guide/index.mdx`:
- Around line 78-85: The documentation instructs users to edit the manifest file
but then applies the remote URL directly using kubectl apply, which bypasses any
local edits. This means the user's changes to metadata.namespace, image tags,
and storageUri are ignored, leaving the deployment with unintended defaults. To
fix this, modify the instructions to first download the remote YAML file to a
local location using curl or wget (storing it in a variable or file), then edit
that local file, and finally apply the local file path instead of the remote URL
in the kubectl apply command.

In `@docs/en/inference_guide/qwen3-14b.mdx`:
- Around line 43-47: The bash code snippet includes a comment stating to "edit
namespace / image tag / storageUri first" but then immediately applies the
remote file directly without demonstrating any editing step, creating a mismatch
between the instructions and the actual command. Either modify the bash commands
to show how to download the file first (using curl or wget), edit it locally,
and then apply the local copy, or update the introductory comment to accurately
reflect that the remote file is being applied directly without local
modifications.

In `@docs/en/inference_guide/qwen3-30b-a3b.mdx`:
- Around line 49-53: The bash snippet instructs users to edit namespace, image
tag, and storageUri values before applying, but then immediately applies from a
remote URL without incorporating those edits. Restructure the snippet to
download the manifest file first using curl or wget into a local variable, then
apply the local file after editing. Alternatively, show how to apply the remote
URL with kubectl set or sed to inject the edited values, ensuring the documented
edit steps actually take effect when kubectl apply is executed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ded490a7-b19e-4d10-bb06-8121804fb4c9

📥 Commits

Reviewing files that changed from the base of the PR and between 5cf3cff and 5aab0b6.

📒 Files selected for processing (7)

docs/en/inference_guide/assets/qwen3-14b/qwen3-14b-vllm-ascend-tp1.yaml
docs/en/inference_guide/assets/qwen3-14b/qwen3-14b-vllm-ascend-tp2.yaml
docs/en/inference_guide/assets/qwen3-30b-a3b/qwen3-30b-a3b-mindie-tp4.yaml
docs/en/inference_guide/assets/qwen3-30b-a3b/qwen3-30b-a3b-vllm-ascend-tp4.yaml
docs/en/inference_guide/index.mdx
docs/en/inference_guide/qwen3-14b.mdx
docs/en/inference_guide/qwen3-30b-a3b.mdx

coderabbitai · 2026-06-18T06:59:02Z

+            #!/bin/bash
+            # run_mindie.sh — start MindIE Service for a given model.
+            # Required: --model-name, --model-path. Optional: --ip, --max-seq-len,
+            # --max-iter-times, --world-size, ... (run with --help for the full list).
+            help() { awk -F'### ' '/^###/ { print $2 }' "$0"; }
+            if [[ $# == 0 ]] || [[ "$1" == "--help" ]]; then help; exit 1; fi
+
+            total_count=$(npu-smi info -l | grep "Total Count" | awk -F ':' '{print $2}' | xargs)
+            if [[ -z "$total_count" ]]; then
+                echo "Error: unable to read device info (npu-smi). Check permissions/devices."
+                exit 1
+            fi
+            echo "$total_count device(s) detected!"
+
+            echo "Setting toolkit envs..."
+            source /usr/local/Ascend/ascend-toolkit/set_env.sh
+            echo "Setting MindIE envs..."
+            source /usr/local/Ascend/mindie/set_env.sh
+
+            MF_SCRIPTS_ROOT=$(realpath "$(dirname "$0")")
+            export PYTHONPATH=$MF_SCRIPTS_ROOT/../:$PYTHONPATH
+
+            export MIES_INSTALL_PATH=/usr/local/Ascend/mindie/latest/mindie-service
+            CONFIG_FILE=${MIES_INSTALL_PATH}/conf/config.json
+
+            # defaults
+            BACKEND_TYPE="atb"; MAX_SEQ_LEN=16384; MAX_PREFILL_TOKENS=16384
+            MAX_ITER_TIMES=1536; MAX_INPUT_TOKEN_LEN=12288; TRUNCATION=false
+            HTTPS_ENABLED=false; MULTI_NODES_INFER_ENABLED=false; NPU_MEM_SIZE=-1
+            MAX_PREFILL_BATCH_SIZE=50; TEMPLATE_TYPE="Standard"; MAX_PREEMPT_COUNT=0
+            SUPPORT_SELECT_BATCH=false; IP_ADDRESS="0.0.0.0"; PORT=8080
+            MANAGEMENT_IP_ADDRESS="127.0.0.2"; MANAGEMENT_PORT=1026; METRICS_PORT=1027
+
+            while [[ "$#" -gt 0 ]]; do
+                case $1 in
+                    --model-path) MODEL_WEIGHT_PATH="$2"; shift ;;
+                    --model-name) MODEL_NAME="$2"; shift ;;
+                    --max-seq-len) MAX_SEQ_LEN="$2"; shift ;;
+                    --max-iter-times) MAX_ITER_TIMES="$2"; shift ;;
+                    --max-input-token-len) MAX_INPUT_TOKEN_LEN="$2"; shift ;;
+                    --max-prefill-tokens) MAX_PREFILL_TOKENS="$2"; shift ;;
+                    --world-size) WORLD_SIZE="$2"; shift ;;
+                    --ip) IP_ADDRESS="$2"; shift ;;
+                    --port) PORT="$2"; shift ;;
+                    *) echo "Unknown parameter: $1"; exit 1 ;;
+                esac
+                shift
+            done


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Startup script should fail fast on command errors.

Without strict mode, failed source/chmod/sed steps can be ignored and mindieservice_daemon may start with partial config.

🔧 Suggested fix

#!/bin/bash + set -euo pipefail # run_mindie.sh — start MindIE Service for a given model.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#!/bin/bash

# run_mindie.sh — start MindIE Service for a given model.

# Required: --model-name, --model-path. Optional: --ip, --max-seq-len,

# --max-iter-times, --world-size, ... (run with --help for the full list).

help() { awk -F'### ' '/^###/ { print $2 }' "$0"; }

if [[ $# == 0 ]] || [[ "$1" == "--help" ]]; then help; exit 1; fi

total_count=$(npu-smi info -l | grep "Total Count" | awk -F ':' '{print $2}' | xargs)

if [[ -z "$total_count" ]]; then

echo "Error: unable to read device info (npu-smi). Check permissions/devices."

exit 1

fi

echo "$total_count device(s) detected!"

echo "Setting toolkit envs..."

source /usr/local/Ascend/ascend-toolkit/set_env.sh

echo "Setting MindIE envs..."

source /usr/local/Ascend/mindie/set_env.sh

MF_SCRIPTS_ROOT=$(realpath "$(dirname "$0")")

export PYTHONPATH=$MF_SCRIPTS_ROOT/../:$PYTHONPATH

export MIES_INSTALL_PATH=/usr/local/Ascend/mindie/latest/mindie-service

CONFIG_FILE=${MIES_INSTALL_PATH}/conf/config.json

# defaults

BACKEND_TYPE="atb"; MAX_SEQ_LEN=16384; MAX_PREFILL_TOKENS=16384

MAX_ITER_TIMES=1536; MAX_INPUT_TOKEN_LEN=12288; TRUNCATION=false

HTTPS_ENABLED=false; MULTI_NODES_INFER_ENABLED=false; NPU_MEM_SIZE=-1

MAX_PREFILL_BATCH_SIZE=50; TEMPLATE_TYPE="Standard"; MAX_PREEMPT_COUNT=0

SUPPORT_SELECT_BATCH=false; IP_ADDRESS="0.0.0.0"; PORT=8080

MANAGEMENT_IP_ADDRESS="127.0.0.2"; MANAGEMENT_PORT=1026; METRICS_PORT=1027

while [[ "$#" -gt 0 ]]; do

case $1 in

--model-path) MODEL_WEIGHT_PATH="$2"; shift ;;

--model-name) MODEL_NAME="$2"; shift ;;

--max-seq-len) MAX_SEQ_LEN="$2"; shift ;;

--max-iter-times) MAX_ITER_TIMES="$2"; shift ;;

--max-input-token-len) MAX_INPUT_TOKEN_LEN="$2"; shift ;;

--max-prefill-tokens) MAX_PREFILL_TOKENS="$2"; shift ;;

--world-size) WORLD_SIZE="$2"; shift ;;

--ip) IP_ADDRESS="$2"; shift ;;

--port) PORT="$2"; shift ;;

*) echo "Unknown parameter: $1"; exit 1 ;;

esac

shift

done

#!/bin/bash

set -euo pipefail

# run_mindie.sh — start MindIE Service for a given model.

# Required: --model-name, --model-path. Optional: --ip, --max-seq-len,

# --max-iter-times, --world-size, ... (run with --help for the full list).

help() { awk -F'### ' '/^###/ { print $2 }' "$0"; }

if [[ $# == 0 ]] || [[ "$1" == "--help" ]]; then help; exit 1; fi

total_count=$(npu-smi info -l | grep "Total Count" | awk -F ':' '{print $2}' | xargs)

if [[ -z "$total_count" ]]; then

echo "Error: unable to read device info (npu-smi). Check permissions/devices."

exit 1

fi

echo "$total_count device(s) detected!"

echo "Setting toolkit envs..."

source /usr/local/Ascend/ascend-toolkit/set_env.sh

echo "Setting MindIE envs..."

source /usr/local/Ascend/mindie/set_env.sh

MF_SCRIPTS_ROOT=$(realpath "$(dirname "$0")")

export PYTHONPATH=$MF_SCRIPTS_ROOT/../:$PYTHONPATH

export MIES_INSTALL_PATH=/usr/local/Ascend/mindie/latest/mindie-service

CONFIG_FILE=${MIES_INSTALL_PATH}/conf/config.json

# defaults

BACKEND_TYPE="atb"; MAX_SEQ_LEN=16384; MAX_PREFILL_TOKENS=16384

MAX_ITER_TIMES=1536; MAX_INPUT_TOKEN_LEN=12288; TRUNCATION=false

HTTPS_ENABLED=false; MULTI_NODES_INFER_ENABLED=false; NPU_MEM_SIZE=-1

MAX_PREFILL_BATCH_SIZE=50; TEMPLATE_TYPE="Standard"; MAX_PREEMPT_COUNT=0

SUPPORT_SELECT_BATCH=false; IP_ADDRESS="0.0.0.0"; PORT=8080

MANAGEMENT_IP_ADDRESS="127.0.0.2"; MANAGEMENT_PORT=1026; METRICS_PORT=1027

while [[ "$#" -gt 0 ]]; do

case $1 in

--model-path) MODEL_WEIGHT_PATH="$2"; shift ;;

--model-name) MODEL_NAME="$2"; shift ;;

--max-seq-len) MAX_SEQ_LEN="$2"; shift ;;

--max-iter-times) MAX_ITER_TIMES="$2"; shift ;;

--max-input-token-len) MAX_INPUT_TOKEN_LEN="$2"; shift ;;

--max-prefill-tokens) MAX_PREFILL_TOKENS="$2"; shift ;;

--world-size) WORLD_SIZE="$2"; shift ;;

--ip) IP_ADDRESS="$2"; shift ;;

--port) PORT="$2"; shift ;;

*) echo "Unknown parameter: $1"; exit 1 ;;

esac

shift

done

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/en/inference_guide/assets/qwen3-30b-a3b/qwen3-30b-a3b-mindie-tp4.yaml` around lines 60 - 107, The bash script in the diff lacks strict error handling mode, which means failed commands like `source` will be silently ignored and the script will continue executing with a potentially incomplete configuration. Add strict mode directives (set -e and optionally set -u and set -o pipefail) immediately after the shebang and before the help() function definition to ensure the script fails immediately if any command fails, preventing mindieservice_daemon from starting with partial or broken configuration.

cloudflare-workers-and-pages · 2026-06-18T07:01:54Z

Deploying alauda-ai with Cloudflare Pages

Latest commit:	`46b1d6d`
Status:	✅ Deploy successful!
Preview URL:	https://a16cec19.alauda-ai.pages.dev
Branch Preview URL:	https://docs-inference-guide-validat.alauda-ai.pages.dev

View logs

… +3 models - Host all YAML assets + HTML reports under docs/public/ so customers download from the docs site (site-absolute /inference_guide/... links), not GitHub. - Show the complete benchmark data: full 22-column open-loop sweeps (rate 1-9 x 4 workloads x both engines x TP, TTFT/E2E/ITL/TPS at p90/p95/p99/mean) in collapsible <details>, plus the rendered HTML reports as downloadable artifacts. Tables generated faithfully from the source reports (no hand-transcription). - Add three more validated models (5 total): - DeepSeek-R1-Distill-Llama-8B (dense, mature Llama path anchor) - DeepSeek-R1-Distill-Llama-70B (dense, TP=8; accuracy openllm 6-task mean 0.722) - GLM-5.1-W4A8 (MoE, W4A8 quantized, TP=8; Partner-Guide chatbot sweep) Each with a namespace-scoped ServingRuntime + InferenceService asset. - Add domain terms to the cspell dictionary. Lint and build pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Move the YAML assets and HTML reports from docs/public/ back under docs/en/inference_guide/{assets,reports}/ and link them via GitHub (tree/raw URLs for YAML, blob URL for reports) — matching the existing training_guides/training-runtimes convention. Reverts the docs-site public-hosting approach. Lint and build pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Remove the copied model-auto HTML benchmark reports (and their links) — do not ship them in our docs. - Keep all benchmark *results* (saturation-capacity tables, rate-1 snapshots, the full 22-column open-loop sweeps inline, accuracy table, GLM chatbot table) but remove the *analysis*: Tuning notes / Insights sections, the "Picking an engine" recommendations, and interpretive prose / "recommended" labels. Pages now present verified facts, configs, and data only. Lint and build pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…Qwen3-30B-A3B Apply the rate=1 chatbot ITL P90 ≈ 30ms SLO. Only Qwen3-30B-A3B (MindIE TP=2, ITL P90 30.8ms / mean 29.0) meets it; remove the models that do not: - Qwen3-14B (44.6ms), DeepSeek-R1-Distill-Llama-8B (~38ms), DeepSeek-R1-Distill-Llama-70B (56ms), GLM-5.1-W4A8 (218ms) — pages + assets. Add the SLO-compliant MindIE TP=2 asset (the TP=4 asset is 39.8ms, over SLO) and lead the deploy section with it. Trim the index runtime catalog and analysis text left over from the removed models. Lint and build pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

docs/en/inference_guide/qwen3-30b-a3b.mdx (1)
29-31: ⚡ Quick win

Clarify TP=2 availability for vLLM deployment assets.

The validation matrix states vLLM TP=2/TP=4, but the deploy table links only vLLM TP=4. Add a one-line note clarifying whether TP=2 is benchmark-only or provide the TP=2 asset link to avoid reader confusion.

Also applies to: 44-47
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/en/inference_guide/qwen3-30b-a3b.mdx` around lines 29 - 31, The
validation matrix for vLLM-Ascend indicates support for both TP=2 and TP=4
configurations, but the corresponding deployment table link only references
TP=4, creating ambiguity about TP=2 availability. Add a one-line clarifying note
in or near the vLLM-Ascend row entries that explicitly states whether TP=2 is
benchmark-only or provide the actual deployment asset link for TP=2 to resolve
the discrepancy. Apply the same clarification to the other affected rows
mentioned in the "Also applies to" section (lines 44-47).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/en/inference_guide/assets/qwen3-30b-a3b/qwen3-30b-a3b-mindie-tp2.yaml`:
- Around line 65-69: The validation for the total_count variable only checks if
it is empty using the -z test, but does not verify that it is a positive
integer. If total_count is zero or contains non-numeric characters, the device
ID generation logic downstream will produce invalid topology configurations.
Enhance the validation condition to check not only that total_count is non-empty
but also that it contains only digits and is greater than zero, rejecting any
non-numeric or zero values with an appropriate error message before the value is
used in device ID generation.

---

Nitpick comments:
In `@docs/en/inference_guide/qwen3-30b-a3b.mdx`:
- Around line 29-31: The validation matrix for vLLM-Ascend indicates support for
both TP=2 and TP=4 configurations, but the corresponding deployment table link
only references TP=4, creating ambiguity about TP=2 availability. Add a one-line
clarifying note in or near the vLLM-Ascend row entries that explicitly states
whether TP=2 is benchmark-only or provide the actual deployment asset link for
TP=2 to resolve the discrepancy. Apply the same clarification to the other
affected rows mentioned in the "Also applies to" section (lines 44-47).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ad4c58bd-11d7-4bac-804b-ff593ac0fe27

📥 Commits

Reviewing files that changed from the base of the PR and between 5aab0b6 and 46b1d6d.

📒 Files selected for processing (6)

.cspell/terms.txt
docs/en/inference_guide/assets/qwen3-30b-a3b/qwen3-30b-a3b-mindie-tp2.yaml
docs/en/inference_guide/assets/qwen3-30b-a3b/qwen3-30b-a3b-mindie-tp4.yaml
docs/en/inference_guide/assets/qwen3-30b-a3b/qwen3-30b-a3b-vllm-ascend-tp4.yaml
docs/en/inference_guide/index.mdx
docs/en/inference_guide/qwen3-30b-a3b.mdx

✅ Files skipped from review due to trivial changes (1)

.cspell/terms.txt

🚧 Files skipped from review as they are similar to previous changes (2)

docs/en/inference_guide/assets/qwen3-30b-a3b/qwen3-30b-a3b-mindie-tp4.yaml
docs/en/inference_guide/assets/qwen3-30b-a3b/qwen3-30b-a3b-vllm-ascend-tp4.yaml

Replace the validated model with Qwen3.6-27B (qwen3_5 GDN hybrid, W8A8) on Ascend 910B4. Document only the validated TP=2 x 4-replica (8-card) topology, with a self-contained vLLM-Ascend nightly ServingRuntime + InferenceService. Benchmarks show rate=1 for all three workloads (chat/code/RAG), comparing the direct predictor Service vs the Envoy AI Gateway ingress. The RPS column is hidden and a concurrency column (Little's law: achieved RPS x mean E2E) is added. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai

♻️ Duplicate comments (1)

docs/en/inference_guide/index.mdx (1)

74-74: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Apply the edited local manifest, not the remote URL.

Line 74 applies the remote file directly, which can bypass local edits (namespace/image/storageUri) and deploy unintended defaults.

Suggested fix

 base=https://raw.githubusercontent.com/alauda/aml-docs/master/docs/en/inference_guide/assets
+cfg=./qwen3-6-27b-w8a8-vllm-ascend-tp2x4.yaml
+
+# 1. Download and edit locally.
+curl -fsSL "$base/qwen3-6-27b-w8a8/qwen3-6-27b-w8a8-vllm-ascend-tp2x4.yaml" -o "$cfg"
+#    Edit namespace/image/storageUri in "$cfg"
 
 # 2. Apply the ServingRuntime + InferenceService.
-kubectl apply -f $base/qwen3-6-27b-w8a8/qwen3-6-27b-w8a8-vllm-ascend-tp2x4.yaml
+kubectl apply -f "$cfg"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/en/inference_guide/index.mdx` at line 74, The kubectl apply command on
line 74 is applying the manifest directly from a remote URL, which bypasses any
local edits made to the file (such as namespace, image, or storageUri changes).
Instead of using the remote file path in the kubectl apply command, modify it to
reference a locally edited copy of the qwen3-6-27b-w8a8-vllm-ascend-tp2x4.yaml
manifest file. This ensures that your local customizations are applied when
deploying the resource.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@docs/en/inference_guide/index.mdx`:
- Line 74: The kubectl apply command on line 74 is applying the manifest
directly from a remote URL, which bypasses any local edits made to the file
(such as namespace, image, or storageUri changes). Instead of using the remote
file path in the kubectl apply command, modify it to reference a locally edited
copy of the qwen3-6-27b-w8a8-vllm-ascend-tp2x4.yaml manifest file. This ensures
that your local customizations are applied when deploying the resource.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 229049e5-1449-465d-ae4e-9a5e054bb62e

📥 Commits

Reviewing files that changed from the base of the PR and between 46b1d6d and b4d1e16.

📒 Files selected for processing (3)

docs/en/inference_guide/assets/qwen3-6-27b-w8a8/qwen3-6-27b-w8a8-vllm-ascend-tp2x4.yaml
docs/en/inference_guide/index.mdx
docs/en/inference_guide/qwen3-6-27b-w8a8.mdx

✅ Files skipped from review due to trivial changes (1)

docs/en/inference_guide/qwen3-6-27b-w8a8.mdx

coderabbitai Bot reviewed Jun 18, 2026

View reviewed changes

zgsu and others added 4 commits June 18, 2026 07:30

coderabbitai Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread docs/en/inference_guide/assets/qwen3-30b-a3b/qwen3-30b-a3b-mindie-tp2.yaml Outdated

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: add inference_guide with validated 7B+ models (Ascend NPU)#268

docs: add inference_guide with validated 7B+ models (Ascend NPU)#268
EdisonSu768 wants to merge 6 commits into
masterfrom
docs/inference-guide-validated-models

EdisonSu768 commented Jun 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

EdisonSu768 commented Jun 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Contents

Files

Scope note

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying alauda-ai with Cloudflare Pages

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EdisonSu768 commented Jun 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading

cloudflare-workers-and-pages Bot commented Jun 18, 2026 •

edited

Loading