diff --git a/mkdocs/docs/concepts/services.md b/mkdocs/docs/concepts/services.md
index 757546483..000ad7de8 100644
--- a/mkdocs/docs/concepts/services.md
+++ b/mkdocs/docs/concepts/services.md
@@ -420,7 +420,10 @@ Below is an example for running `zai-org/GLM-4.5-Air-FP8` on `H200`:
- > With the `sglang` router, you can use SGLang prefill and decode workers. Support for vLLM and TensorRT-LLM workers is coming soon.
+ > SMG workers connect to the router over HTTP or gRPC. The example above uses HTTP. SGLang workers support both modes; vLLM workers support gRPC only.
+
+ ??? info "gRPC mode"
+ Over gRPC, workers run from SMG images that bundle a specific backend version (SGLang or vLLM), and `smg launch` needs `--enable-igw` and `--model-path` so the router can register the workers. See the full configurations in [SGLang PD disaggregation](../examples/inference/sglang.md#pd-disaggregation) and [vLLM PD disaggregation](../examples/inference/vllm.md#pd-disaggregation).
=== "Dynamo"
diff --git a/mkdocs/docs/examples/inference/sglang.md b/mkdocs/docs/examples/inference/sglang.md
index 1ea9e6e06..5bf25ed5d 100644
--- a/mkdocs/docs/examples/inference/sglang.md
+++ b/mkdocs/docs/examples/inference/sglang.md
@@ -211,7 +211,80 @@ To run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/
- > With the `sglang` router, you can use SGLang prefill and decode workers. Support for vLLM and TensorRT-LLM workers is coming soon.
+ ??? info "gRPC mode"
+
+ SGLang workers can also connect to the SMG router over gRPC. Run the workers from an SMG image that bundles the SGLang version, pass `--grpc-mode`, and add `--enable-igw` and `--model-path` to `smg launch` so the router can register them.
+
+
+
+ ```yaml
+ type: service
+ name: prefill-decode
+
+ env:
+ - HF_TOKEN
+ - MODEL_ID=zai-org/GLM-4.5-Air-FP8
+
+ replicas:
+ - count: 1
+ # For now replica group with router must have count: 1
+ python: "3.12"
+ commands:
+ - pip install smg
+ - |
+ smg launch \
+ --enable-igw \
+ --pd-disaggregation \
+ --model-path $MODEL_ID \
+ --host 0.0.0.0 \
+ --port 8000 \
+ --prefill-policy cache_aware
+ router:
+ type: sglang
+ resources:
+ cpu: 4
+
+ - count: 1..4
+ scaling:
+ metric: rps
+ target: 3
+ image: ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10
+ commands:
+ - |
+ python3 -m sglang.launch_server \
+ --model-path $MODEL_ID \
+ --host 0.0.0.0 \
+ --port 8000 \
+ --grpc-mode \
+ --disaggregation-mode prefill \
+ --disaggregation-transfer-backend nixl \
+ --disaggregation-bootstrap-port 8998
+ resources:
+ gpu: H200
+
+ - count: 1..8
+ scaling:
+ metric: rps
+ target: 2
+ image: ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10
+ commands:
+ - |
+ python3 -m sglang.launch_server \
+ --model-path $MODEL_ID \
+ --host 0.0.0.0 \
+ --port 8000 \
+ --grpc-mode \
+ --disaggregation-mode decode \
+ --disaggregation-transfer-backend nixl
+ resources:
+ gpu: H200
+
+ port: 8000
+ ```
+
+
+
+ To use the [Mooncake](https://github.com/kvcache-ai/Mooncake) transfer backend, set `--disaggregation-transfer-backend mooncake`.
=== "AMD"
diff --git a/mkdocs/docs/examples/inference/vllm.md b/mkdocs/docs/examples/inference/vllm.md
index dd6909ba6..fe1575ed8 100644
--- a/mkdocs/docs/examples/inference/vllm.md
+++ b/mkdocs/docs/examples/inference/vllm.md
@@ -124,6 +124,84 @@ curl http://127.0.0.1:3000/proxy/services/main/qwen36/v1/chat/completions \
> If a [gateway](../../concepts/gateways.md) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://qwen36./`.
+## Configuration options
+
+### PD disaggregation
+
+To run vLLM with [PD disaggregation](https://docs.vllm.ai/en/latest/serving/disagg_prefill.html), use replica groups: one for [Shepherd Model Gateway (SMG)](https://docs.sglang.io/advanced_features/sgl_model_gateway.html), one for prefill workers, and one for decode workers.
+
+
+
+```yaml
+type: service
+name: prefill-decode
+
+env:
+ - HF_TOKEN
+ - MODEL_ID=zai-org/GLM-4.5-Air-FP8
+
+replicas:
+ - count: 1
+ python: "3.12"
+ commands:
+ - pip install smg
+ - |
+ smg launch \
+ --pd-disaggregation \
+ --model-path $MODEL_ID \
+ --enable-igw \
+ --host 0.0.0.0 \
+ --port 8000 \
+ --prefill-policy cache_aware
+ router:
+ type: sglang
+ resources:
+ cpu: 4
+
+ - count: 1..4
+ scaling:
+ metric: rps
+ target: 3
+ image: ghcr.io/lightseekorg/smg:1.4.1-vllm-v0.18.0
+ commands:
+ - |
+ python3 -m vllm.entrypoints.grpc_server \
+ --model "$MODEL_ID" \
+ --host 0.0.0.0 \
+ --port 8000 \
+ --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'
+ resources:
+ gpu: H200
+
+ - count: 1..8
+ scaling:
+ metric: rps
+ target: 2
+ image: ghcr.io/lightseekorg/smg:1.4.1-vllm-v0.18.0
+ commands:
+ - |
+ python3 -m vllm.entrypoints.grpc_server \
+ --model "$MODEL_ID" \
+ --host 0.0.0.0 \
+ --port 8000 \
+ --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'
+ resources:
+ gpu: H200
+
+port: 8000
+```
+
+
+
+> To use the [Mooncake Transfer](https://github.com/kvcache-ai/Mooncake) backend, set `"kv_connector": "MooncakeConnector"` in `--kv-transfer-config`.
+
+Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon.
+
+!!! info "Cluster"
+ PD disaggregation requires the service to run in a fleet with `placement` set to `cluster`, because the replicas require an interconnect between instances.
+
+ While the prefill and decode replicas run on GPUs, the router replica requires a CPU instance in the same cluster.
+
## What's next?
1. Read about [services](../../concepts/services.md) and [gateways](../../concepts/gateways.md)