Skip to content

docs: add inference_guide with validated 7B+ models (Ascend NPU)#268

Open
EdisonSu768 wants to merge 6 commits into
masterfrom
docs/inference-guide-validated-models
Open

docs: add inference_guide with validated 7B+ models (Ascend NPU)#268
EdisonSu768 wants to merge 6 commits into
masterfrom
docs/inference-guide-validated-models

Conversation

@EdisonSu768

@EdisonSu768 EdisonSu768 commented Jun 18, 2026

Copy link
Copy Markdown
Member

What

New docs/en/inference_guide/ section documenting already-validated open-weight LLM inference, mirroring the structure of training_guides/training-runtimes. All content is derived from verified deployments + benchmarks (no fabricated numbers).

Contents

Two validated models above the 7B/8B class:

  • Qwen3-14B (dense, BF16) — recommended engine vLLM-Ascend
  • Qwen3-30B-A3B (MoE, BF16) — recommended engine MindIE

Per the four asks:

  • Models (2, ≥7B/8B) ✅
  • Runtime images (GPU/NPU) — NPU vLLM-Ascend + MindIE catalog, with an NVIDIA GPU note ✅
  • Runtime YAML examples — namespace-scoped ServingRuntime + InferenceService (not ClusterServingRuntime), one per model/engine/TP combination ✅
  • Model + runtime + image benchmark results — measured guidellm open-loop per-replica numbers (4 workloads × rate 1–9), with the dense→vLLM / MoE→MindIE engine-selection finding ✅

Files

docs/en/inference_guide/
├── index.mdx                         # overview, runtime-image catalog, engine selection, methodology, deploy steps
├── qwen3-14b.mdx                     # dense model card + benchmark tables
├── qwen3-30b-a3b.mdx                 # MoE model card + benchmark tables
└── assets/
    ├── qwen3-14b/qwen3-14b-vllm-ascend-tp1.yaml
    ├── qwen3-14b/qwen3-14b-vllm-ascend-tp2.yaml
    ├── qwen3-30b-a3b/qwen3-30b-a3b-vllm-ascend-tp4.yaml
    └── qwen3-30b-a3b/qwen3-30b-a3b-mindie-tp4.yaml

Scope note

Every validated 7B+ model currently runs on Ascend 910B NPU; there is no verified GPU benchmark at this size yet (only Qwen3.5-0.8B on A30). The doc therefore leads with NPU and lists the NVIDIA GPU runtime as the platform default with that gap flagged, rather than inventing GPU numbers.

Verification

  • yarn lint → 0 errors / 0 warnings
  • yarn build → all 3 pages render
  • All 4 asset YAMLs parse as valid multi-doc YAML

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Documentation
    • Updated inference guide with validated Qwen3.6-27B (W8A8) deployment configuration on Huawei Ascend 910B4.
    • Added new deployment guide with benchmark methodology and performance results.
    • Updated runtime image information and deployment examples.
    • Enhanced guidance for multi-card tensor parallelism setup.

New `inference_guide/` section documenting already-validated open-weight
LLM inference, mirroring the structure of `training_guides/training-runtimes`:

- Two validated models above the 7B/8B class:
  - Qwen3-14B (dense, BF16) — recommended engine vLLM-Ascend
  - Qwen3-30B-A3B (MoE, BF16) — recommended engine MindIE
- Runtime images catalog (NPU vLLM-Ascend + MindIE; NVIDIA GPU note)
- Per-model namespace-scoped `ServingRuntime` + `InferenceService` assets
  (not ClusterServingRuntime), one per engine/TP combination
- Measured open-loop per-replica benchmark tables (guidellm, 4 workloads,
  rate 1-9) with the dense→vLLM / MoE→MindIE engine-selection finding

Lint and build pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Review Change Stack

Walkthrough

The inference guide is updated from Qwen3-30B-A3B to Qwen3.6-27B (W8A8) on Huawei Ascend 910B4. A new KServe YAML asset defines the ServingRuntime and InferenceService with TP=2 × 4 replicas. A new model-specific MDX page and updates to the guide index cover validated hardware, runtime image tags, benchmark workloads, deployment instructions, and caveats.

Changes

Inference Guide: Qwen3.6-27B (W8A8) on Ascend 910B4

Layer / File(s) Summary
ServingRuntime + InferenceService YAML
docs/en/inference_guide/assets/qwen3-6-27b-w8a8/qwen3-6-27b-w8a8-vllm-ascend-tp2x4.yaml
Adds the full KServe deployment manifest: a ServingRuntime wrapping vllm serve with a bash entrypoint that sources Ascend env scripts and derives MODEL_PATH from annotations, plus an InferenceService with 4 replicas (TP=2 each), PVC-backed weights, W8A8 quantization, no-prefix-caching, speculative decoding, CUDAGraph FULL_DECODE_ONLY, and Ascend910 resource requests/limits.
Inference guide index page
docs/en/inference_guide/index.mdx
Rewrites the page intro, replaces the validated models table with Qwen3.6-27B (W8A8) on Ascend 910B4 ×8, updates runtime image tags with CANN-match and MindIE qwen3_5 support notes, redefines benchmark workloads to three token-size pairs, updates the deploy walkthrough and curl example to the new model/asset path, and revises caveats to Ascend 910B4 resource keys.
Qwen3.6-27B (W8A8) model page
docs/en/inference_guide/qwen3-6-27b-w8a8.mdx
New model documentation page with identity metadata (architecture, parameter counts, BF16 vs W8A8, HF/ModelScope links), hardware × stack compatibility table (vLLM-Ascend nightly supported, MindIE unsupported), deploy section with TP=2 × 4 topology and max-num-seqs 32 guardrail, and benchmark results tables for Chat/Code/RAG workloads comparing Direct vs Gateway ingress.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • alauda/aml-docs#234: Shares documentation of Modelcar permission modes and multi-card TP>1 HCCL initialization requirements for vLLM-Ascend on Huawei Ascend 910B4, both of which are referenced in the updated Caveats section.

Poem

🐇 A new model hops into the guide,
W8A8 weights, eight Ascend cards wide,
TP=2 times four — replicas in a row,
vllm serve starts with an Ascend env flow,
Benchmarks logged, the gateway path too,
Qwen3.6-27B, validated and true! 🌸

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding inference guide documentation with validated models (Qwen3.6-27B W8A8) for Ascend NPU deployment.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/inference-guide-validated-models

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/en/inference_guide/assets/qwen3-30b-a3b/qwen3-30b-a3b-mindie-tp4.yaml`:
- Around line 60-107: The bash script in the diff lacks strict error handling
mode, which means failed commands like `source` will be silently ignored and the
script will continue executing with a potentially incomplete configuration. Add
strict mode directives (set -e and optionally set -u and set -o pipefail)
immediately after the shebang and before the help() function definition to
ensure the script fails immediately if any command fails, preventing
mindieservice_daemon from starting with partial or broken configuration.

In `@docs/en/inference_guide/index.mdx`:
- Around line 78-85: The documentation instructs users to edit the manifest file
but then applies the remote URL directly using kubectl apply, which bypasses any
local edits. This means the user's changes to metadata.namespace, image tags,
and storageUri are ignored, leaving the deployment with unintended defaults. To
fix this, modify the instructions to first download the remote YAML file to a
local location using curl or wget (storing it in a variable or file), then edit
that local file, and finally apply the local file path instead of the remote URL
in the kubectl apply command.

In `@docs/en/inference_guide/qwen3-14b.mdx`:
- Around line 43-47: The bash code snippet includes a comment stating to "edit
namespace / image tag / storageUri first" but then immediately applies the
remote file directly without demonstrating any editing step, creating a mismatch
between the instructions and the actual command. Either modify the bash commands
to show how to download the file first (using curl or wget), edit it locally,
and then apply the local copy, or update the introductory comment to accurately
reflect that the remote file is being applied directly without local
modifications.

In `@docs/en/inference_guide/qwen3-30b-a3b.mdx`:
- Around line 49-53: The bash snippet instructs users to edit namespace, image
tag, and storageUri values before applying, but then immediately applies from a
remote URL without incorporating those edits. Restructure the snippet to
download the manifest file first using curl or wget into a local variable, then
apply the local file after editing. Alternatively, show how to apply the remote
URL with kubectl set or sed to inject the edited values, ensuring the documented
edit steps actually take effect when kubectl apply is executed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ded490a7-b19e-4d10-bb06-8121804fb4c9

📥 Commits

Reviewing files that changed from the base of the PR and between 5cf3cff and 5aab0b6.

📒 Files selected for processing (7)
  • docs/en/inference_guide/assets/qwen3-14b/qwen3-14b-vllm-ascend-tp1.yaml
  • docs/en/inference_guide/assets/qwen3-14b/qwen3-14b-vllm-ascend-tp2.yaml
  • docs/en/inference_guide/assets/qwen3-30b-a3b/qwen3-30b-a3b-mindie-tp4.yaml
  • docs/en/inference_guide/assets/qwen3-30b-a3b/qwen3-30b-a3b-vllm-ascend-tp4.yaml
  • docs/en/inference_guide/index.mdx
  • docs/en/inference_guide/qwen3-14b.mdx
  • docs/en/inference_guide/qwen3-30b-a3b.mdx

Comment on lines +60 to +107
#!/bin/bash
# run_mindie.sh — start MindIE Service for a given model.
# Required: --model-name, --model-path. Optional: --ip, --max-seq-len,
# --max-iter-times, --world-size, ... (run with --help for the full list).
help() { awk -F'### ' '/^###/ { print $2 }' "$0"; }
if [[ $# == 0 ]] || [[ "$1" == "--help" ]]; then help; exit 1; fi

total_count=$(npu-smi info -l | grep "Total Count" | awk -F ':' '{print $2}' | xargs)
if [[ -z "$total_count" ]]; then
echo "Error: unable to read device info (npu-smi). Check permissions/devices."
exit 1
fi
echo "$total_count device(s) detected!"

echo "Setting toolkit envs..."
source /usr/local/Ascend/ascend-toolkit/set_env.sh
echo "Setting MindIE envs..."
source /usr/local/Ascend/mindie/set_env.sh

MF_SCRIPTS_ROOT=$(realpath "$(dirname "$0")")
export PYTHONPATH=$MF_SCRIPTS_ROOT/../:$PYTHONPATH

export MIES_INSTALL_PATH=/usr/local/Ascend/mindie/latest/mindie-service
CONFIG_FILE=${MIES_INSTALL_PATH}/conf/config.json

# defaults
BACKEND_TYPE="atb"; MAX_SEQ_LEN=16384; MAX_PREFILL_TOKENS=16384
MAX_ITER_TIMES=1536; MAX_INPUT_TOKEN_LEN=12288; TRUNCATION=false
HTTPS_ENABLED=false; MULTI_NODES_INFER_ENABLED=false; NPU_MEM_SIZE=-1
MAX_PREFILL_BATCH_SIZE=50; TEMPLATE_TYPE="Standard"; MAX_PREEMPT_COUNT=0
SUPPORT_SELECT_BATCH=false; IP_ADDRESS="0.0.0.0"; PORT=8080
MANAGEMENT_IP_ADDRESS="127.0.0.2"; MANAGEMENT_PORT=1026; METRICS_PORT=1027

while [[ "$#" -gt 0 ]]; do
case $1 in
--model-path) MODEL_WEIGHT_PATH="$2"; shift ;;
--model-name) MODEL_NAME="$2"; shift ;;
--max-seq-len) MAX_SEQ_LEN="$2"; shift ;;
--max-iter-times) MAX_ITER_TIMES="$2"; shift ;;
--max-input-token-len) MAX_INPUT_TOKEN_LEN="$2"; shift ;;
--max-prefill-tokens) MAX_PREFILL_TOKENS="$2"; shift ;;
--world-size) WORLD_SIZE="$2"; shift ;;
--ip) IP_ADDRESS="$2"; shift ;;
--port) PORT="$2"; shift ;;
*) echo "Unknown parameter: $1"; exit 1 ;;
esac
shift
done

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Startup script should fail fast on command errors.

Without strict mode, failed source/chmod/sed steps can be ignored and mindieservice_daemon may start with partial config.

🔧 Suggested fix
             #!/bin/bash
+            set -euo pipefail
             # run_mindie.sh — start MindIE Service for a given model.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#!/bin/bash
# run_mindie.sh — start MindIE Service for a given model.
# Required: --model-name, --model-path. Optional: --ip, --max-seq-len,
# --max-iter-times, --world-size, ... (run with --help for the full list).
help() { awk -F'### ' '/^###/ { print $2 }' "$0"; }
if [[ $# == 0 ]] || [[ "$1" == "--help" ]]; then help; exit 1; fi
total_count=$(npu-smi info -l | grep "Total Count" | awk -F ':' '{print $2}' | xargs)
if [[ -z "$total_count" ]]; then
echo "Error: unable to read device info (npu-smi). Check permissions/devices."
exit 1
fi
echo "$total_count device(s) detected!"
echo "Setting toolkit envs..."
source /usr/local/Ascend/ascend-toolkit/set_env.sh
echo "Setting MindIE envs..."
source /usr/local/Ascend/mindie/set_env.sh
MF_SCRIPTS_ROOT=$(realpath "$(dirname "$0")")
export PYTHONPATH=$MF_SCRIPTS_ROOT/../:$PYTHONPATH
export MIES_INSTALL_PATH=/usr/local/Ascend/mindie/latest/mindie-service
CONFIG_FILE=${MIES_INSTALL_PATH}/conf/config.json
# defaults
BACKEND_TYPE="atb"; MAX_SEQ_LEN=16384; MAX_PREFILL_TOKENS=16384
MAX_ITER_TIMES=1536; MAX_INPUT_TOKEN_LEN=12288; TRUNCATION=false
HTTPS_ENABLED=false; MULTI_NODES_INFER_ENABLED=false; NPU_MEM_SIZE=-1
MAX_PREFILL_BATCH_SIZE=50; TEMPLATE_TYPE="Standard"; MAX_PREEMPT_COUNT=0
SUPPORT_SELECT_BATCH=false; IP_ADDRESS="0.0.0.0"; PORT=8080
MANAGEMENT_IP_ADDRESS="127.0.0.2"; MANAGEMENT_PORT=1026; METRICS_PORT=1027
while [[ "$#" -gt 0 ]]; do
case $1 in
--model-path) MODEL_WEIGHT_PATH="$2"; shift ;;
--model-name) MODEL_NAME="$2"; shift ;;
--max-seq-len) MAX_SEQ_LEN="$2"; shift ;;
--max-iter-times) MAX_ITER_TIMES="$2"; shift ;;
--max-input-token-len) MAX_INPUT_TOKEN_LEN="$2"; shift ;;
--max-prefill-tokens) MAX_PREFILL_TOKENS="$2"; shift ;;
--world-size) WORLD_SIZE="$2"; shift ;;
--ip) IP_ADDRESS="$2"; shift ;;
--port) PORT="$2"; shift ;;
*) echo "Unknown parameter: $1"; exit 1 ;;
esac
shift
done
#!/bin/bash
set -euo pipefail
# run_mindie.sh — start MindIE Service for a given model.
# Required: --model-name, --model-path. Optional: --ip, --max-seq-len,
# --max-iter-times, --world-size, ... (run with --help for the full list).
help() { awk -F'### ' '/^###/ { print $2 }' "$0"; }
if [[ $# == 0 ]] || [[ "$1" == "--help" ]]; then help; exit 1; fi
total_count=$(npu-smi info -l | grep "Total Count" | awk -F ':' '{print $2}' | xargs)
if [[ -z "$total_count" ]]; then
echo "Error: unable to read device info (npu-smi). Check permissions/devices."
exit 1
fi
echo "$total_count device(s) detected!"
echo "Setting toolkit envs..."
source /usr/local/Ascend/ascend-toolkit/set_env.sh
echo "Setting MindIE envs..."
source /usr/local/Ascend/mindie/set_env.sh
MF_SCRIPTS_ROOT=$(realpath "$(dirname "$0")")
export PYTHONPATH=$MF_SCRIPTS_ROOT/../:$PYTHONPATH
export MIES_INSTALL_PATH=/usr/local/Ascend/mindie/latest/mindie-service
CONFIG_FILE=${MIES_INSTALL_PATH}/conf/config.json
# defaults
BACKEND_TYPE="atb"; MAX_SEQ_LEN=16384; MAX_PREFILL_TOKENS=16384
MAX_ITER_TIMES=1536; MAX_INPUT_TOKEN_LEN=12288; TRUNCATION=false
HTTPS_ENABLED=false; MULTI_NODES_INFER_ENABLED=false; NPU_MEM_SIZE=-1
MAX_PREFILL_BATCH_SIZE=50; TEMPLATE_TYPE="Standard"; MAX_PREEMPT_COUNT=0
SUPPORT_SELECT_BATCH=false; IP_ADDRESS="0.0.0.0"; PORT=8080
MANAGEMENT_IP_ADDRESS="127.0.0.2"; MANAGEMENT_PORT=1026; METRICS_PORT=1027
while [[ "$#" -gt 0 ]]; do
case $1 in
--model-path) MODEL_WEIGHT_PATH="$2"; shift ;;
--model-name) MODEL_NAME="$2"; shift ;;
--max-seq-len) MAX_SEQ_LEN="$2"; shift ;;
--max-iter-times) MAX_ITER_TIMES="$2"; shift ;;
--max-input-token-len) MAX_INPUT_TOKEN_LEN="$2"; shift ;;
--max-prefill-tokens) MAX_PREFILL_TOKENS="$2"; shift ;;
--world-size) WORLD_SIZE="$2"; shift ;;
--ip) IP_ADDRESS="$2"; shift ;;
--port) PORT="$2"; shift ;;
*) echo "Unknown parameter: $1"; exit 1 ;;
esac
shift
done
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/en/inference_guide/assets/qwen3-30b-a3b/qwen3-30b-a3b-mindie-tp4.yaml`
around lines 60 - 107, The bash script in the diff lacks strict error handling
mode, which means failed commands like `source` will be silently ignored and the
script will continue executing with a potentially incomplete configuration. Add
strict mode directives (set -e and optionally set -u and set -o pipefail)
immediately after the shebang and before the help() function definition to
ensure the script fails immediately if any command fails, preventing
mindieservice_daemon from starting with partial or broken configuration.

Comment thread docs/en/inference_guide/index.mdx
Comment thread docs/en/inference_guide/qwen3-14b.mdx Outdated
Comment thread docs/en/inference_guide/qwen3-30b-a3b.mdx Outdated
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 18, 2026

Copy link
Copy Markdown

Deploying alauda-ai with  Cloudflare Pages  Cloudflare Pages

Latest commit: 46b1d6d
Status: ✅  Deploy successful!
Preview URL: https://a16cec19.alauda-ai.pages.dev
Branch Preview URL: https://docs-inference-guide-validat.alauda-ai.pages.dev

View logs

zgsu and others added 4 commits June 18, 2026 07:30
… +3 models

- Host all YAML assets + HTML reports under docs/public/ so customers download
  from the docs site (site-absolute /inference_guide/... links), not GitHub.
- Show the complete benchmark data: full 22-column open-loop sweeps (rate 1-9 x
  4 workloads x both engines x TP, TTFT/E2E/ITL/TPS at p90/p95/p99/mean) in
  collapsible <details>, plus the rendered HTML reports as downloadable artifacts.
  Tables generated faithfully from the source reports (no hand-transcription).
- Add three more validated models (5 total):
  - DeepSeek-R1-Distill-Llama-8B (dense, mature Llama path anchor)
  - DeepSeek-R1-Distill-Llama-70B (dense, TP=8; accuracy openllm 6-task mean 0.722)
  - GLM-5.1-W4A8 (MoE, W4A8 quantized, TP=8; Partner-Guide chatbot sweep)
  Each with a namespace-scoped ServingRuntime + InferenceService asset.
- Add domain terms to the cspell dictionary.

Lint and build pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the YAML assets and HTML reports from docs/public/ back under
docs/en/inference_guide/{assets,reports}/ and link them via GitHub
(tree/raw URLs for YAML, blob URL for reports) — matching the existing
training_guides/training-runtimes convention. Reverts the docs-site
public-hosting approach.

Lint and build pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Remove the copied model-auto HTML benchmark reports (and their links) — do
  not ship them in our docs.
- Keep all benchmark *results* (saturation-capacity tables, rate-1 snapshots,
  the full 22-column open-loop sweeps inline, accuracy table, GLM chatbot
  table) but remove the *analysis*: Tuning notes / Insights sections, the
  "Picking an engine" recommendations, and interpretive prose / "recommended"
  labels. Pages now present verified facts, configs, and data only.

Lint and build pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Qwen3-30B-A3B

Apply the rate=1 chatbot ITL P90 ≈ 30ms SLO. Only Qwen3-30B-A3B (MindIE TP=2,
ITL P90 30.8ms / mean 29.0) meets it; remove the models that do not:
- Qwen3-14B (44.6ms), DeepSeek-R1-Distill-Llama-8B (~38ms),
  DeepSeek-R1-Distill-Llama-70B (56ms), GLM-5.1-W4A8 (218ms) — pages + assets.

Add the SLO-compliant MindIE TP=2 asset (the TP=4 asset is 39.8ms, over SLO) and
lead the deploy section with it. Trim the index runtime catalog and analysis text
left over from the removed models.

Lint and build pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
docs/en/inference_guide/qwen3-30b-a3b.mdx (1)

29-31: ⚡ Quick win

Clarify TP=2 availability for vLLM deployment assets.

The validation matrix states vLLM TP=2/TP=4, but the deploy table links only vLLM TP=4. Add a one-line note clarifying whether TP=2 is benchmark-only or provide the TP=2 asset link to avoid reader confusion.

Also applies to: 44-47

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/en/inference_guide/qwen3-30b-a3b.mdx` around lines 29 - 31, The
validation matrix for vLLM-Ascend indicates support for both TP=2 and TP=4
configurations, but the corresponding deployment table link only references
TP=4, creating ambiguity about TP=2 availability. Add a one-line clarifying note
in or near the vLLM-Ascend row entries that explicitly states whether TP=2 is
benchmark-only or provide the actual deployment asset link for TP=2 to resolve
the discrepancy. Apply the same clarification to the other affected rows
mentioned in the "Also applies to" section (lines 44-47).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/en/inference_guide/assets/qwen3-30b-a3b/qwen3-30b-a3b-mindie-tp2.yaml`:
- Around line 65-69: The validation for the total_count variable only checks if
it is empty using the -z test, but does not verify that it is a positive
integer. If total_count is zero or contains non-numeric characters, the device
ID generation logic downstream will produce invalid topology configurations.
Enhance the validation condition to check not only that total_count is non-empty
but also that it contains only digits and is greater than zero, rejecting any
non-numeric or zero values with an appropriate error message before the value is
used in device ID generation.

---

Nitpick comments:
In `@docs/en/inference_guide/qwen3-30b-a3b.mdx`:
- Around line 29-31: The validation matrix for vLLM-Ascend indicates support for
both TP=2 and TP=4 configurations, but the corresponding deployment table link
only references TP=4, creating ambiguity about TP=2 availability. Add a one-line
clarifying note in or near the vLLM-Ascend row entries that explicitly states
whether TP=2 is benchmark-only or provide the actual deployment asset link for
TP=2 to resolve the discrepancy. Apply the same clarification to the other
affected rows mentioned in the "Also applies to" section (lines 44-47).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ad4c58bd-11d7-4bac-804b-ff593ac0fe27

📥 Commits

Reviewing files that changed from the base of the PR and between 5aab0b6 and 46b1d6d.

📒 Files selected for processing (6)
  • .cspell/terms.txt
  • docs/en/inference_guide/assets/qwen3-30b-a3b/qwen3-30b-a3b-mindie-tp2.yaml
  • docs/en/inference_guide/assets/qwen3-30b-a3b/qwen3-30b-a3b-mindie-tp4.yaml
  • docs/en/inference_guide/assets/qwen3-30b-a3b/qwen3-30b-a3b-vllm-ascend-tp4.yaml
  • docs/en/inference_guide/index.mdx
  • docs/en/inference_guide/qwen3-30b-a3b.mdx
✅ Files skipped from review due to trivial changes (1)
  • .cspell/terms.txt
🚧 Files skipped from review as they are similar to previous changes (2)
  • docs/en/inference_guide/assets/qwen3-30b-a3b/qwen3-30b-a3b-mindie-tp4.yaml
  • docs/en/inference_guide/assets/qwen3-30b-a3b/qwen3-30b-a3b-vllm-ascend-tp4.yaml

Comment thread docs/en/inference_guide/assets/qwen3-30b-a3b/qwen3-30b-a3b-mindie-tp2.yaml Outdated
Replace the validated model with Qwen3.6-27B (qwen3_5 GDN hybrid, W8A8) on
Ascend 910B4. Document only the validated TP=2 x 4-replica (8-card) topology,
with a self-contained vLLM-Ascend nightly ServingRuntime + InferenceService.

Benchmarks show rate=1 for all three workloads (chat/code/RAG), comparing the
direct predictor Service vs the Envoy AI Gateway ingress. The RPS column is
hidden and a concurrency column (Little's law: achieved RPS x mean E2E) is
added.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
docs/en/inference_guide/index.mdx (1)

74-74: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Apply the edited local manifest, not the remote URL.

Line 74 applies the remote file directly, which can bypass local edits (namespace/image/storageUri) and deploy unintended defaults.

Suggested fix
 base=https://raw.githubusercontent.com/alauda/aml-docs/master/docs/en/inference_guide/assets
+cfg=./qwen3-6-27b-w8a8-vllm-ascend-tp2x4.yaml
+
+# 1. Download and edit locally.
+curl -fsSL "$base/qwen3-6-27b-w8a8/qwen3-6-27b-w8a8-vllm-ascend-tp2x4.yaml" -o "$cfg"
+#    Edit namespace/image/storageUri in "$cfg"
 
 # 2. Apply the ServingRuntime + InferenceService.
-kubectl apply -f $base/qwen3-6-27b-w8a8/qwen3-6-27b-w8a8-vllm-ascend-tp2x4.yaml
+kubectl apply -f "$cfg"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/en/inference_guide/index.mdx` at line 74, The kubectl apply command on
line 74 is applying the manifest directly from a remote URL, which bypasses any
local edits made to the file (such as namespace, image, or storageUri changes).
Instead of using the remote file path in the kubectl apply command, modify it to
reference a locally edited copy of the qwen3-6-27b-w8a8-vllm-ascend-tp2x4.yaml
manifest file. This ensures that your local customizations are applied when
deploying the resource.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@docs/en/inference_guide/index.mdx`:
- Line 74: The kubectl apply command on line 74 is applying the manifest
directly from a remote URL, which bypasses any local edits made to the file
(such as namespace, image, or storageUri changes). Instead of using the remote
file path in the kubectl apply command, modify it to reference a locally edited
copy of the qwen3-6-27b-w8a8-vllm-ascend-tp2x4.yaml manifest file. This ensures
that your local customizations are applied when deploying the resource.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 229049e5-1449-465d-ae4e-9a5e054bb62e

📥 Commits

Reviewing files that changed from the base of the PR and between 46b1d6d and b4d1e16.

📒 Files selected for processing (3)
  • docs/en/inference_guide/assets/qwen3-6-27b-w8a8/qwen3-6-27b-w8a8-vllm-ascend-tp2x4.yaml
  • docs/en/inference_guide/index.mdx
  • docs/en/inference_guide/qwen3-6-27b-w8a8.mdx
✅ Files skipped from review due to trivial changes (1)
  • docs/en/inference_guide/qwen3-6-27b-w8a8.mdx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant