llmleaf is a llm proxy. It proxies different llm providers and their slighty different apis and converts it a a single api surface.
- fast
- efficient
- light-weight
- extensible
- One stable endpoint in front of every provider — consumers speak OpenAI, OpenRouter, or Anthropic dialects; llmleaf maps them to one internal model and back.
- Streaming-first (SSE); a non-streaming response is just a collected stream.
- Modalities: chat, embeddings, text-to-speech, speech-to-text, realtime (WebSocket), batch jobs.
- Per-model fallback chains with node-local, health-aware switchover — no consensus or shared state, so N nodes run behind a plain load balancer.
- Opt-in per request/provider: Anthropic prompt caching, a unified thinking/reasoning-effort ladder.
- Auth via HTTP-Basic key tokens (optional OAuth2/JWT); identity, limits, and usage ride an outbound control plane (pull verdicts, push usage). Fully operable from the config file alone.
- Native dialects: Anthropic, Google Gemini, Vertex AI, Cohere, Ollama, LM Studio.
- OpenAI-wire family: OpenAI, OpenRouter, Requesty, Groq, DeepSeek, xAI (Grok), Mistral, Together, Fireworks, Perplexity, Cerebras, Z.AI (GLM), Moonshot (Kimi), Azure OpenAI.
echofor local testing.
# Run with the embedded dev config (echo provider, key `local-dev:s3cret`)
cargo run -p llmleaf
# …or point at your own config
cargo run -p llmleaf -- llmleaf.tomlCopy llmleaf.example.toml, fill in provider credentials (use env:VAR indirection — secrets
never live in the file), and pass it as the argument. Container image: docker buildx bake image
(listens on :8080). Send a request:
curl localhost:8080/v1/chat/completions \
-H "Authorization: Bearer $(printf 'local-dev:s3cret' | base64)" \
-d '{"model":"demo","messages":[{"role":"user","content":"hi"}]}'See llmleaf.example.toml for the full configuration surface (providers, routes, keys, control plane).
Consumer endpoints (OpenAI-compatible unless noted):
| Endpoint | Purpose |
|---|---|
POST /v1/chat/completions |
Chat (SSE streaming) |
POST /v1/messages |
Anthropic Messages dialect |
POST /v1/embeddings |
Embeddings |
POST /v1/audio/speech, GET /v1/audio/voices |
Text-to-speech |
POST /v1/audio/transcriptions |
Speech-to-text |
GET /v1/realtime |
OpenAI Realtime (WebSocket) |
POST /v1/batches, GET /v1/batches/{id}[/results] |
Batch jobs |
GET /v1/models, GET /v1/openapi.json, GET /healthz |
Discovery & health |
Read-only admin (optional token): GET /admin/routes, /admin/health, /admin/keys.
Official client SDKs for 6 languages live in clients/.
Two strictly separated planes. The core (data plane) is the proxy; the control plane is reached only outbound — the core pulls identity/verdicts and pushes usage, never the reverse. See SOUL.md for the full design constitution.
flowchart LR
Cons["Consumers<br/>OpenAI · OpenRouter · Anthropic"] --> Surf["Compat surfaces"]
subgraph Core["llmleaf core — data plane"]
direction LR
Surf --> Auth["authenticate"] --> In["map in"] --> Route["route + fallback"] --> Stream["stream"] --> Out["map out"] --> Ev["emit events"]
end
Route --> Prov["Providers<br/>compiled-in traits · WASM plugins"]
Prov --> Up["LLM providers"]
Ctrl[["Control plane (outbound)"]]
Auth -. "pull identity / verdicts" .-> Ctrl
Ev -. "push usage" .-> Ctrl
Copyright (C) 2026 Fionn Langhans fionnlanghans@codefionn.eu.
llmleaf is free software licensed under the GNU Lesser General Public License,
version 3 or later (LGPL-3.0-or-later). The full text is in COPYING.LESSER
(the LGPLv3 terms) together with COPYING (the GPLv3 it builds on).
Clients are licensed under MIT and APACHE-2.0 license.
