feat: resilient OpenAI client — timeout, retries, circuit breaker, bulkhead, metrics#261
Conversation
…ker, bulkhead, and metrics - New src/config/openai.ts: reads all resilience settings from env vars - New src/lib/resilientOpenAIClient.ts: per-request AbortController timeout, exponential backoff + jitter retries honoring Retry-After on 429, circuit breaker (closed→open→half-open), concurrency bulkhead, prom-client metrics, structured Winston logging, FEATURE_FLAG_RESILIENT_OPENAI bypass - Update ai.service.ts: openAi property now uses ResilientOpenAIClient wrapper - New src/lib/resilientOpenAIClient.spec.ts: 22 tests covering all features - New docs/resilient-openai.md: env vars, behavior, rollout plan - Add prom-client@15.1.3 dependency
- Clarify ResilientOpenAIError re-throw comment in retry loop - Fix spelling: Honour -> Honor - Add comment explaining Math.max(0, ms) defensive guard in sleep() - Remove unused noop helper; inline it at the call site - Fix doc table spelling: honoured -> honored
There was a problem hiding this comment.
Pull request overview
Introduces a ResilientOpenAIClient wrapper in the backend to harden OpenAI calls against transient/sustained failures (timeouts, retries, circuit breaker, concurrency bulkhead) and adds Prometheus instrumentation plus unit tests. The wrapper is integrated into AIService via a narrow OpenAIClientLike interface to keep call sites unchanged.
Changes:
- Added
ResilientOpenAIClientwith timeout, retry/backoff, circuit breaker, concurrency limiting, logging, and prom-client metrics. - Added env-driven configuration (
packages/backend/src/config/openai.ts) and unit tests for the wrapper. - Updated
AIServiceto instantiate the resilient wrapper around the OpenAI SDK client; addedprom-clientdependency.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/backend/src/lib/resilientOpenAIClient.ts | New resilient wrapper implementation (timeouts, retries, circuit breaker, bulkhead, metrics/logging). |
| packages/backend/src/lib/resilientOpenAIClient.spec.ts | New Vitest unit tests covering retry/timeout/circuit/bulkhead/metrics behaviors. |
| packages/backend/src/config/openai.ts | New env-var configuration loader with defaults for resilience knobs and feature flag. |
| packages/backend/src/ai/ai.service.ts | Switches openAi to the resilient wrapper while keeping .responses.create usage unchanged. |
| packages/backend/package.json | Adds prom-client dependency for metrics. |
| package-lock.json | Locks new dependency graph for prom-client (and transitive deps). |
| docs/resilient-openai.md | Documents env vars, behavior, and rollout/rollback plan for the resilient client. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
@copilot fix failing checks |
Fixed in the latest commit. Two CI jobs were failing:
|
Transient or sustained OpenAI failures can cascade and take down Moonbeam entirely. This adds a
ResilientOpenAIClientwrapper that absorbs those failures and provides graceful degradation.Resilience features
AbortControllerper request; default 10 s (OPENAI_TIMEOUT_MS)OPENAI_RETRIES,OPENAI_BACKOFF_BASE_MS);Retry-Afterheader honored on 429CIRCUIT_BREAKER_FAILURES), half-open probe after configurable window + probe interval (CIRCUIT_BREAKER_WINDOW_MS,CIRCUIT_BREAKER_PROBE_MS)OPENAI_CONCURRENCY(default 10) instead of queueing unbounded workResilientOpenAIErrorwith typed codes (CIRCUIT_OPEN,TIMEOUT,CONCURRENCY_REJECTED) so callers can surface degraded UX without crashing the processprom-clientcounters + histogram:openai_requests_total,openai_retries_total,openai_failures_total,openai_circuit_open_total,openai_latency_secondsIntegration
AIService.openAiis now typed asOpenAIClientLike(exposesresponses.create) and instantiated as aResilientOpenAIClientwrapping the underlying SDK client. All existing call sites are unchanged.Feature flag / rollback
Set
FEATURE_FLAG_RESILIENT_OPENAI=falseto bypass the wrapper entirely and delegate directly to the SDK — instant rollback without a deploy.New files
src/config/openai.tssrc/lib/resilientOpenAIClient.tssrc/lib/resilientOpenAIClient.spec.tsdocs/resilient-openai.mdOriginal prompt
Implement a resilient OpenAI client wrapper for the TypeScript/Node codebase in repository dev-chat/mocker. The goal is to prevent transient or repeated OpenAI failures from taking down Moonbeam/the whole service by introducing timeouts, retries, a circuit breaker, concurrency (bulkhead) limits, graceful degradation, and observability. The agent should search the repository for existing OpenAI usage (packages named
openai,openai-client, or direct HTTP calls to api.openai.com) and either wrap those call-sites or replace imports to use the new resilient client. Do NOT hard-assign owners.Requirements (high-level):
Resilient wrapper implementation (TypeScript):
src/lib/resilientOpenAIClient.ts(or similar canonical location). Export a drop-in compatible client API (asyncrequest/createChatCompletion/responsesetc. or a wrapper around the existing OpenAI client class) so existing call sites can be switched with minimal changes.Retry-Afterheader for 429 responses.prom-client(or integrate with the repo's existing telemetry): counters for openai_requests_total, openai_retries_total, openai_failures_total, openai_circuit_open_total, openai_latency_seconds histogram. Make metric names configurable or follow the default names above.Integration plan:
new OpenAI(...)or direct HTTP calls with the resilient wrapper import, orsrc/lib/openaiAdapter.ts) that imports the resilient client and re-exports the same surface API to minimize changes.FEATURE_FLAG_RESILIENT_OPENAI(defaulttrue). When false, the wrapper should delegate directly to the original client for easier rollback.Tests:
Documentation & config:
docs/resilient-openai.mddescribing env vars and behavior, and how to enable/disable via the feature flag.src/config/openai.tsthat reads env vars with sensible defaults.Rollout / migration plan (include in PR description and a short section in docs):
FEATURE_FLAG_RESILIENT_OPENAIfeature flag.Implementation notes / libraries suggestions (agent may choose alternatives if they fit better):
opossumor implement a small custom circuit-breaker if preferred.p-limitorbottleneck.Retry-Aftersemantics on 429.prom-clientfor Prometheus; if the repo already uses Datadog/other S...This pull request was created from Copilot chat.