Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion mkdocs/docs/concepts/services.md
Original file line number Diff line number Diff line change
Expand Up @@ -420,7 +420,10 @@ Below is an example for running `zai-org/GLM-4.5-Air-FP8` on `H200`:

</div>

> With the `sglang` router, you can use SGLang prefill and decode workers. Support for vLLM and TensorRT-LLM workers is coming soon.
> SMG workers connect to the router over HTTP or gRPC. The example above uses HTTP. SGLang workers support both modes; vLLM workers support gRPC only.

??? info "gRPC mode"
Over gRPC, workers run from SMG images that bundle a specific backend version (SGLang or vLLM), and `smg launch` needs `--enable-igw` and `--model-path` so the router can register the workers. See the full configurations in [SGLang PD disaggregation](../examples/inference/sglang.md#pd-disaggregation) and [vLLM PD disaggregation](../examples/inference/vllm.md#pd-disaggregation).

=== "Dynamo"

Expand Down
75 changes: 74 additions & 1 deletion mkdocs/docs/examples/inference/sglang.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,7 +211,80 @@ To run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/

</div>

> With the `sglang` router, you can use SGLang prefill and decode workers. Support for vLLM and TensorRT-LLM workers is coming soon.
??? info "gRPC mode"

SGLang workers can also connect to the SMG router over gRPC. Run the workers from an SMG image that bundles the SGLang version, pass `--grpc-mode`, and add `--enable-igw` and `--model-path` to `smg launch` so the router can register them.

<div editor-title="pd-grpc.dstack.yml">

```yaml
type: service
name: prefill-decode

env:
- HF_TOKEN
- MODEL_ID=zai-org/GLM-4.5-Air-FP8

replicas:
- count: 1
# For now replica group with router must have count: 1
python: "3.12"
commands:
- pip install smg
- |
smg launch \
--enable-igw \
--pd-disaggregation \
--model-path $MODEL_ID \
--host 0.0.0.0 \
--port 8000 \
--prefill-policy cache_aware
router:
type: sglang
resources:
cpu: 4

- count: 1..4
scaling:
metric: rps
target: 3
image: ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10
commands:
- |
python3 -m sglang.launch_server \
--model-path $MODEL_ID \
--host 0.0.0.0 \
--port 8000 \
--grpc-mode \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--disaggregation-bootstrap-port 8998
resources:
gpu: H200

- count: 1..8
scaling:
metric: rps
target: 2
image: ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10
commands:
- |
python3 -m sglang.launch_server \
--model-path $MODEL_ID \
--host 0.0.0.0 \
--port 8000 \
--grpc-mode \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl
resources:
gpu: H200

port: 8000
```

</div>

To use the [Mooncake](https://github.com/kvcache-ai/Mooncake) transfer backend, set `--disaggregation-transfer-backend mooncake`.

=== "AMD"

Expand Down
78 changes: 78 additions & 0 deletions mkdocs/docs/examples/inference/vllm.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,84 @@ curl http://127.0.0.1:3000/proxy/services/main/qwen36/v1/chat/completions \

> If a [gateway](../../concepts/gateways.md) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://qwen36.<gateway domain>/`.

## Configuration options

### PD disaggregation

To run vLLM with [PD disaggregation](https://docs.vllm.ai/en/latest/serving/disagg_prefill.html), use replica groups: one for [Shepherd Model Gateway (SMG)](https://docs.sglang.io/advanced_features/sgl_model_gateway.html), one for prefill workers, and one for decode workers.

<div editor-title="pd.dstack.yml">

```yaml
type: service
name: prefill-decode

env:
- HF_TOKEN
- MODEL_ID=zai-org/GLM-4.5-Air-FP8

replicas:
- count: 1
python: "3.12"
commands:
- pip install smg
- |
smg launch \
--pd-disaggregation \
--model-path $MODEL_ID \
--enable-igw \
--host 0.0.0.0 \
--port 8000 \
--prefill-policy cache_aware
router:
type: sglang
resources:
cpu: 4

- count: 1..4
scaling:
metric: rps
target: 3
image: ghcr.io/lightseekorg/smg:1.4.1-vllm-v0.18.0
commands:
- |
python3 -m vllm.entrypoints.grpc_server \
--model "$MODEL_ID" \
--host 0.0.0.0 \
--port 8000 \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'
resources:
gpu: H200

- count: 1..8
scaling:
metric: rps
target: 2
image: ghcr.io/lightseekorg/smg:1.4.1-vllm-v0.18.0
commands:
- |
python3 -m vllm.entrypoints.grpc_server \
--model "$MODEL_ID" \
--host 0.0.0.0 \
--port 8000 \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'
resources:
gpu: H200

port: 8000
```

</div>

> To use the [Mooncake Transfer](https://github.com/kvcache-ai/Mooncake) backend, set `"kv_connector": "MooncakeConnector"` in `--kv-transfer-config`.

Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon.

!!! info "Cluster"
PD disaggregation requires the service to run in a fleet with `placement` set to `cluster`, because the replicas require an interconnect between instances.

While the prefill and decode replicas run on GPUs, the router replica requires a CPU instance in the same cluster.

## What's next?

1. Read about [services](../../concepts/services.md) and [gateways](../../concepts/gateways.md)
Expand Down
Loading