diff --git a/mkdocs/docs/concepts/services.md b/mkdocs/docs/concepts/services.md index 757546483..000ad7de8 100644 --- a/mkdocs/docs/concepts/services.md +++ b/mkdocs/docs/concepts/services.md @@ -420,7 +420,10 @@ Below is an example for running `zai-org/GLM-4.5-Air-FP8` on `H200`: - > With the `sglang` router, you can use SGLang prefill and decode workers. Support for vLLM and TensorRT-LLM workers is coming soon. + > SMG workers connect to the router over HTTP or gRPC. The example above uses HTTP. SGLang workers support both modes; vLLM workers support gRPC only. + + ??? info "gRPC mode" + Over gRPC, workers run from SMG images that bundle a specific backend version (SGLang or vLLM), and `smg launch` needs `--enable-igw` and `--model-path` so the router can register the workers. See the full configurations in [SGLang PD disaggregation](../examples/inference/sglang.md#pd-disaggregation) and [vLLM PD disaggregation](../examples/inference/vllm.md#pd-disaggregation). === "Dynamo" diff --git a/mkdocs/docs/examples/inference/sglang.md b/mkdocs/docs/examples/inference/sglang.md index 1ea9e6e06..5bf25ed5d 100644 --- a/mkdocs/docs/examples/inference/sglang.md +++ b/mkdocs/docs/examples/inference/sglang.md @@ -211,7 +211,80 @@ To run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/ - > With the `sglang` router, you can use SGLang prefill and decode workers. Support for vLLM and TensorRT-LLM workers is coming soon. + ??? info "gRPC mode" + + SGLang workers can also connect to the SMG router over gRPC. Run the workers from an SMG image that bundles the SGLang version, pass `--grpc-mode`, and add `--enable-igw` and `--model-path` to `smg launch` so the router can register them. + +
+ + ```yaml + type: service + name: prefill-decode + + env: + - HF_TOKEN + - MODEL_ID=zai-org/GLM-4.5-Air-FP8 + + replicas: + - count: 1 + # For now replica group with router must have count: 1 + python: "3.12" + commands: + - pip install smg + - | + smg launch \ + --enable-igw \ + --pd-disaggregation \ + --model-path $MODEL_ID \ + --host 0.0.0.0 \ + --port 8000 \ + --prefill-policy cache_aware + router: + type: sglang + resources: + cpu: 4 + + - count: 1..4 + scaling: + metric: rps + target: 3 + image: ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10 + commands: + - | + python3 -m sglang.launch_server \ + --model-path $MODEL_ID \ + --host 0.0.0.0 \ + --port 8000 \ + --grpc-mode \ + --disaggregation-mode prefill \ + --disaggregation-transfer-backend nixl \ + --disaggregation-bootstrap-port 8998 + resources: + gpu: H200 + + - count: 1..8 + scaling: + metric: rps + target: 2 + image: ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10 + commands: + - | + python3 -m sglang.launch_server \ + --model-path $MODEL_ID \ + --host 0.0.0.0 \ + --port 8000 \ + --grpc-mode \ + --disaggregation-mode decode \ + --disaggregation-transfer-backend nixl + resources: + gpu: H200 + + port: 8000 + ``` + +
+ + To use the [Mooncake](https://github.com/kvcache-ai/Mooncake) transfer backend, set `--disaggregation-transfer-backend mooncake`. === "AMD" diff --git a/mkdocs/docs/examples/inference/vllm.md b/mkdocs/docs/examples/inference/vllm.md index dd6909ba6..fe1575ed8 100644 --- a/mkdocs/docs/examples/inference/vllm.md +++ b/mkdocs/docs/examples/inference/vllm.md @@ -124,6 +124,84 @@ curl http://127.0.0.1:3000/proxy/services/main/qwen36/v1/chat/completions \ > If a [gateway](../../concepts/gateways.md) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://qwen36./`. +## Configuration options + +### PD disaggregation + +To run vLLM with [PD disaggregation](https://docs.vllm.ai/en/latest/serving/disagg_prefill.html), use replica groups: one for [Shepherd Model Gateway (SMG)](https://docs.sglang.io/advanced_features/sgl_model_gateway.html), one for prefill workers, and one for decode workers. + +
+ +```yaml +type: service +name: prefill-decode + +env: + - HF_TOKEN + - MODEL_ID=zai-org/GLM-4.5-Air-FP8 + +replicas: + - count: 1 + python: "3.12" + commands: + - pip install smg + - | + smg launch \ + --pd-disaggregation \ + --model-path $MODEL_ID \ + --enable-igw \ + --host 0.0.0.0 \ + --port 8000 \ + --prefill-policy cache_aware + router: + type: sglang + resources: + cpu: 4 + + - count: 1..4 + scaling: + metric: rps + target: 3 + image: ghcr.io/lightseekorg/smg:1.4.1-vllm-v0.18.0 + commands: + - | + python3 -m vllm.entrypoints.grpc_server \ + --model "$MODEL_ID" \ + --host 0.0.0.0 \ + --port 8000 \ + --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}' + resources: + gpu: H200 + + - count: 1..8 + scaling: + metric: rps + target: 2 + image: ghcr.io/lightseekorg/smg:1.4.1-vllm-v0.18.0 + commands: + - | + python3 -m vllm.entrypoints.grpc_server \ + --model "$MODEL_ID" \ + --host 0.0.0.0 \ + --port 8000 \ + --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}' + resources: + gpu: H200 + +port: 8000 +``` + +
+ +> To use the [Mooncake Transfer](https://github.com/kvcache-ai/Mooncake) backend, set `"kv_connector": "MooncakeConnector"` in `--kv-transfer-config`. + +Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon. + +!!! info "Cluster" + PD disaggregation requires the service to run in a fleet with `placement` set to `cluster`, because the replicas require an interconnect between instances. + + While the prefill and decode replicas run on GPUs, the router replica requires a CPU instance in the same cluster. + ## What's next? 1. Read about [services](../../concepts/services.md) and [gateways](../../concepts/gateways.md)