Skip to content

feat: resilient OpenAI client — timeout, retries, circuit breaker, bulkhead, metrics#261

Merged
sfreeman422 merged 7 commits into
masterfrom
copilot/implement-resilient-openai-client
Jun 23, 2026
Merged

feat: resilient OpenAI client — timeout, retries, circuit breaker, bulkhead, metrics#261
sfreeman422 merged 7 commits into
masterfrom
copilot/implement-resilient-openai-client

Conversation

Copilot AI commented Jun 23, 2026

Copy link
Copy Markdown

Transient or sustained OpenAI failures can cascade and take down Moonbeam entirely. This adds a ResilientOpenAIClient wrapper that absorbs those failures and provides graceful degradation.

Resilience features

  • TimeoutAbortController per request; default 10 s (OPENAI_TIMEOUT_MS)
  • Retries — exponential backoff + full jitter, default 3 attempts (OPENAI_RETRIES, OPENAI_BACKOFF_BASE_MS); Retry-After header honored on 429
  • Circuit breaker — opens after 5 consecutive failures (CIRCUIT_BREAKER_FAILURES), half-open probe after configurable window + probe interval (CIRCUIT_BREAKER_WINDOW_MS, CIRCUIT_BREAKER_PROBE_MS)
  • Concurrency bulkhead — rejects excess calls beyond OPENAI_CONCURRENCY (default 10) instead of queueing unbounded work
  • Graceful degradation — throws ResilientOpenAIError with typed codes (CIRCUIT_OPEN, TIMEOUT, CONCURRENCY_REJECTED) so callers can surface degraded UX without crashing the process
  • Metricsprom-client counters + histogram: openai_requests_total, openai_retries_total, openai_failures_total, openai_circuit_open_total, openai_latency_seconds
  • Structured logging — Winston child logger on all state transitions and failures

Integration

AIService.openAi is now typed as OpenAIClientLike (exposes responses.create) and instantiated as a ResilientOpenAIClient wrapping the underlying SDK client. All existing call sites are unchanged.

// Before
openAi = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

// After — same API surface, full resilience applied
openAi: OpenAIClientLike = new ResilientOpenAIClient(
  new OpenAI({ apiKey: process.env.OPENAI_API_KEY }),
);

Feature flag / rollback

Set FEATURE_FLAG_RESILIENT_OPENAI=false to bypass the wrapper entirely and delegate directly to the SDK — instant rollback without a deploy.

New files

Path Purpose
src/config/openai.ts Env-var config with defaults
src/lib/resilientOpenAIClient.ts Wrapper implementation
src/lib/resilientOpenAIClient.spec.ts 22 unit tests (retry, timeout, circuit breaker, concurrency, metrics)
docs/resilient-openai.md Env-var reference + rollout plan
Original prompt

Implement a resilient OpenAI client wrapper for the TypeScript/Node codebase in repository dev-chat/mocker. The goal is to prevent transient or repeated OpenAI failures from taking down Moonbeam/the whole service by introducing timeouts, retries, a circuit breaker, concurrency (bulkhead) limits, graceful degradation, and observability. The agent should search the repository for existing OpenAI usage (packages named openai, openai-client, or direct HTTP calls to api.openai.com) and either wrap those call-sites or replace imports to use the new resilient client. Do NOT hard-assign owners.

Requirements (high-level):

  1. Resilient wrapper implementation (TypeScript):

    • New module: src/lib/resilientOpenAIClient.ts (or similar canonical location). Export a drop-in compatible client API (async request/createChatCompletion/responses etc. or a wrapper around the existing OpenAI client class) so existing call sites can be switched with minimal changes.
    • Features:
      • Per-request timeout (default 10s, configurable via env var OPENAI_TIMEOUT_MS).
      • Retries with exponential backoff + full jitter. Default: 3 retries, base backoff 500ms. Honor Retry-After header for 429 responses.
      • Circuit breaker: open after N consecutive failures (default 5). While open, short-circuit calls and return a graceful error. Automatically probe after a configured interval (default probe every 30s, open window 60s). Make thresholds configurable via env vars (CIRCUIT_BREAKER_FAILURES, CIRCUIT_BREAKER_WINDOW_MS, CIRCUIT_BREAKER_PROBE_MS).
      • Concurrency/bulkhead limiter: limit concurrent outbound OpenAI calls per instance (default 10), configurable via OPENAI_CONCURRENCY.
      • Graceful fallback: when failures exceed thresholds or calls short-circuited, return a standard transient-error object or throw a specific ResilientOpenAIError that upstream code can detect to provide degraded UX. Do not crash the process on downstream errors.
      • Logging: structured logs on failures, retries, circuit-breaker state transitions, and timeouts. Use existing logging utilities in the repo if present; otherwise add a minimal logger that uses the repo's logger conventions.
      • Metrics: instrument with Prometheus-compatible metrics using prom-client (or integrate with the repo's existing telemetry): counters for openai_requests_total, openai_retries_total, openai_failures_total, openai_circuit_open_total, openai_latency_seconds histogram. Make metric names configurable or follow the default names above.
  2. Integration plan:

    • The agent should search the repo for existing OpenAI usage and either:
      • Replace new OpenAI(...) or direct HTTP calls with the resilient wrapper import, or
      • Add a small adapter file (e.g., src/lib/openaiAdapter.ts) that imports the resilient client and re-exports the same surface API to minimize changes.
    • Add feature flag: use an env var FEATURE_FLAG_RESILIENT_OPENAI (default true). When false, the wrapper should delegate directly to the original client for easier rollback.
  3. Tests:

    • Unit tests with Jest (or the repo's test framework) covering:
      • Retry behavior (retries on transient errors and honors Retry-After on 429).
      • Timeout behavior (Aborts request after timeout).
      • Circuit-breaker transitions (closed -> open after configured failures, short-circuited responses, and probe behavior).
      • Concurrency limiter (requests beyond limit are queued or rejected per chosen strategy).
    • Add a light integration test that mocks the OpenAI responses and asserts the wrapper does not throw/unhandled-rejection and produces metrics.
  4. Documentation & config:

    • Add README or a short doc at docs/resilient-openai.md describing env vars and behavior, and how to enable/disable via the feature flag.
    • Expose configuration via a src/config/openai.ts that reads env vars with sensible defaults.
  5. Rollout / migration plan (include in PR description and a short section in docs):

    • Create the change behind the FEATURE_FLAG_RESILIENT_OPENAI feature flag.
    • Deploy to staging first and run smoke tests.
    • Gradually enable in production (start with a single instance or small percentage if you have feature-flagging infra; otherwise enable in a canary host), monitor metrics (errors/retries/circuit open), logs, and Sentry/Datadog.
    • If issues occur, flip feature flag off to rollback.
  6. Implementation notes / libraries suggestions (agent may choose alternatives if they fit better):

    • Circuit breaker: use opossum or implement a small custom circuit-breaker if preferred.
    • Concurrency limiter: use p-limit or bottleneck.
    • Retries/backoff: implement custom backoff with jitter so we can honor Retry-After semantics on 429.
    • Timeouts: use AbortController and ensure the underlying OpenAI client supports AbortSignal (wrap fetch/axios as needed).
    • Metrics: prom-client for Prometheus; if the repo already uses Datadog/other S...

This pull request was created from Copilot chat.

Copilot AI added 2 commits June 23, 2026 15:18
…ker, bulkhead, and metrics

- New src/config/openai.ts: reads all resilience settings from env vars
- New src/lib/resilientOpenAIClient.ts: per-request AbortController timeout,
  exponential backoff + jitter retries honoring Retry-After on 429, circuit
  breaker (closed→open→half-open), concurrency bulkhead, prom-client metrics,
  structured Winston logging, FEATURE_FLAG_RESILIENT_OPENAI bypass
- Update ai.service.ts: openAi property now uses ResilientOpenAIClient wrapper
- New src/lib/resilientOpenAIClient.spec.ts: 22 tests covering all features
- New docs/resilient-openai.md: env vars, behavior, rollout plan
- Add prom-client@15.1.3 dependency
- Clarify ResilientOpenAIError re-throw comment in retry loop
- Fix spelling: Honour -> Honor
- Add comment explaining Math.max(0, ms) defensive guard in sleep()
- Remove unused noop helper; inline it at the call site
- Fix doc table spelling: honoured -> honored
Copilot AI changed the title [WIP] Implement resilient OpenAI client wrapper for TypeScript feat: resilient OpenAI client — timeout, retries, circuit breaker, bulkhead, metrics Jun 23, 2026
Copilot AI requested a review from sfreeman422 June 23, 2026 15:22
@sfreeman422 sfreeman422 marked this pull request as ready for review June 23, 2026 15:23
Copilot AI review requested due to automatic review settings June 23, 2026 15:23

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a ResilientOpenAIClient wrapper in the backend to harden OpenAI calls against transient/sustained failures (timeouts, retries, circuit breaker, concurrency bulkhead) and adds Prometheus instrumentation plus unit tests. The wrapper is integrated into AIService via a narrow OpenAIClientLike interface to keep call sites unchanged.

Changes:

  • Added ResilientOpenAIClient with timeout, retry/backoff, circuit breaker, concurrency limiting, logging, and prom-client metrics.
  • Added env-driven configuration (packages/backend/src/config/openai.ts) and unit tests for the wrapper.
  • Updated AIService to instantiate the resilient wrapper around the OpenAI SDK client; added prom-client dependency.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
packages/backend/src/lib/resilientOpenAIClient.ts New resilient wrapper implementation (timeouts, retries, circuit breaker, bulkhead, metrics/logging).
packages/backend/src/lib/resilientOpenAIClient.spec.ts New Vitest unit tests covering retry/timeout/circuit/bulkhead/metrics behaviors.
packages/backend/src/config/openai.ts New env-var configuration loader with defaults for resilience knobs and feature flag.
packages/backend/src/ai/ai.service.ts Switches openAi to the resilient wrapper while keeping .responses.create usage unchanged.
packages/backend/package.json Adds prom-client dependency for metrics.
package-lock.json Locks new dependency graph for prom-client (and transitive deps).
docs/resilient-openai.md Documents env vars, behavior, and rollout/rollback plan for the resilient client.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/backend/src/lib/resilientOpenAIClient.ts Outdated
Comment thread packages/backend/src/lib/resilientOpenAIClient.ts Outdated
Comment thread packages/backend/src/config/openai.ts
Comment thread packages/backend/src/lib/resilientOpenAIClient.ts
sfreeman422 and others added 3 commits June 23, 2026 11:29
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@sfreeman422

Copy link
Copy Markdown
Collaborator

@copilot fix failing checks

Copilot AI commented Jun 23, 2026

Copy link
Copy Markdown
Author

@copilot fix failing checks

Fixed in the latest commit. Two CI jobs were failing:

  1. Lint (Back End) — Replaced all as type assertions in resilientOpenAIClient.ts with narrowing:

    • METRIC_NAMES: switched from as const to an explicit readonly literal type annotation
    • Reflect.get(error as object, ...) → removed unnecessary casts (TypeScript already narrows unknown to object after the typeof guards)
    • Reflect.get(headers as object, ...) as string | undefined → extracted retryAfterRaw and narrowed with a typeof === 'string' check
    • ResilientOpenAIError in the spec was imported but only used in string comparisons; updated assertions to toBeInstanceOf(ResilientOpenAIError) so the import is a live value reference
  2. Format Check (All) — Ran Prettier on the 3 flagged files (resilientOpenAIClient.ts, resilientOpenAIClient.spec.ts, docs/resilient-openai.md).

@sfreeman422 sfreeman422 merged commit 180c970 into master Jun 23, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants