Skip to content

Add controller error-path tests, channel error reporting, and document OpenAI resilience config#262

Closed
Copilot wants to merge 4 commits into
masterfrom
copilot/harden-openai-api-error-handling
Closed

Add controller error-path tests, channel error reporting, and document OpenAI resilience config#262
Copilot wants to merge 4 commits into
masterfrom
copilot/harden-openai-api-error-handling

Conversation

Copilot AI commented Jun 23, 2026

Copy link
Copy Markdown

The app lacked test coverage proving OpenAI failures don't crash the /image and /prompt-with-history endpoints, AI errors were not being reported to the #muzzlefeedback channel for non-429 failures, and the resilience configuration env vars were undocumented.

Changes

Error Reporting (ai.controller.ts)

  • All three AI endpoints (/text, /image, /prompt-with-history) now post to #muzzlefeedback whenever the underlying service rejects, ensuring every AI failure is visible in the error channel — not just 429 rate-limit errors

Tests (ai.controller.spec.ts)

  • Added sendMessage to the WebService mock
  • Updated error-path tests for all three endpoints to assert both the ephemeral user message and the #muzzlefeedback channel notification are sent on failure
  • Added error-path tests for /image and /prompt-with-history — each verifies the endpoint still returns 200 when the underlying service rejects

Documentation (README.md)

  • Added all 8 OpenAI resilience env vars to the .env example block with inline defaults
  • Added an OpenAI Error Handling subsection under AI Features describing retry/circuit-breaker/timeout/bulkhead behavior and a full config-variable reference table:
Variable Default Description
OPENAI_TIMEOUT_MS 10000 Per-request abort deadline
OPENAI_RETRIES 3 Max retries on 429/5xx/timeout
OPENAI_BACKOFF_BASE_MS 500 Exponential backoff base with full jitter
CIRCUIT_BREAKER_FAILURES 5 Consecutive failures to open the circuit
CIRCUIT_BREAKER_WINDOW_MS 60000 Open window before a probe is allowed
CIRCUIT_BREAKER_PROBE_MS 30000 Min interval between probes
OPENAI_CONCURRENCY 10 Max in-flight calls (bulkhead)
FEATURE_FLAG_RESILIENT_OPENAI true Bypass all resilience logic when false
Original prompt

Goal: Prevent the application from crashing when OpenAI API calls fail. Find all usages of OpenAI (official SDKs, direct fetch/http calls to api.openai.com, or wrappers referencing 'openai', 'OpenAI', 'ChatCompletion', 'Completions', 'chat.completions', 'gpt', etc.) and harden them so transient or permanent OpenAI errors do not crash the app.

Tasks for the coding agent:

  1. Discover OpenAI integration points

    • Search the repo for references to common OpenAI call patterns (Node.js 'openai' package, python 'openai', fetch/XHR to api.openai.com, axios/fetch wrappers, and any custom client code).
  2. Add robust error handling

    • Wrap all OpenAI API calls with try/catch (or Promise.catch) so thrown exceptions or rejected promises never bubble up uncaught and crash the process.
    • For server-side code, ensure errors are translated into well-formed HTTP responses (e.g., 4xx/5xx) with a sanitized, user-friendly error message and an internal log that contains the full error for debugging.
    • For client-side code (if present), show a non-blocking UI error/toast rather than letting the app throw.
  3. Implement retry for transient errors

    • For transient errors (HTTP 429, 502, 503, 504, or network timeouts), add exponential backoff retry (configurable max retries, default 2-3 retries).
    • Do not retry for 4xx errors indicating bad requests or invalid authentication.
  4. Logging and observability

    • Add or reuse existing logging to record OpenAI errors with error class/type, status code, and request id (if present in response headers). Keep sensitive data out of logs (no API keys or user content).
  5. Tests

    • Add unit tests or integration tests that simulate OpenAI errors (mock the OpenAI client or HTTP responses) and verify: (a) the app does not crash, (b) proper error response or UI message is produced, (c) retry behavior for transient errors.
  6. Documentation

    • Update README or relevant docs with a short note about the new error handling and configuration (env vars for max retries, backoff base, etc.).

Acceptance criteria:

  • Running the app and inducing OpenAI failures (rate limit or 500-level errors) should not crash the process.
  • Errors should be logged for debugging and surfaced to callers/UI as a friendly message.
  • A PR is opened with the changes, clear commit messages, and tests demonstrating the fix.

Notes and constraints for the agent:

  • Do not change API keys or secrets. Use config/environment variables for retry limits/backoff settings.
  • Keep changes minimal and targeted to OpenAI call sites. If multiple languages/frameworks are present, prioritize the primary server-side language found in the repository.
  • If anything is ambiguous (e.g., which branch to base the PR on), open the PR against the repository default branch.
  • Make the code idiomatic for the project's language and follow existing style patterns.

Please create a PR with these changes and include a concise PR description summarizing the problem and the fix. Do not assign reviewers or DRI in the PR text.

This pull request was created from Copilot chat.

Copy link
Copy Markdown
Collaborator

@copilot i am only concerned with two things: not crashing the app and reporting openAI errors to <#CKCL8UMJA> channel

Copilot AI changed the title [WIP] Add robust error handling for OpenAI API calls Add controller error-path tests and document OpenAI resilience config Jun 23, 2026
Copilot AI requested a review from sfreeman422 June 23, 2026 16:06

Copy link
Copy Markdown
Collaborator

@copilot this does not achieve the goal - ensure that errors from openai result in a slack post to <#CKCL8UMJA>, and that errors are not thrown/uncaught in such a way that the service crashes

@sfreeman422 sfreeman422 marked this pull request as ready for review June 23, 2026 16:11
Copilot AI review requested due to automatic review settings June 23, 2026 16:11

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds missing error-path coverage for AI controller endpoints to ensure OpenAI/service failures don’t crash Slack slash-command handlers, and documents the resilience-related environment variables/behavior for the OpenAI wrapper.

Changes:

  • Added controller tests verifying /image and /prompt-with-history still return 200 and send an ephemeral message when the underlying service rejects.
  • Tightened the existing /text error-path test to assert sendEphemeral arguments.
  • Expanded README with the OpenAI resilience env vars and an “OpenAI Error Handling” subsection describing retry/circuit-breaker/timeout/bulkhead behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
README.md Documents OpenAI resilience env vars and explains error-handling behavior.
packages/backend/src/ai/ai.controller.spec.ts Adds/strengthens controller-level tests for error paths across AI endpoints.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md Outdated

Copilot AI commented Jun 23, 2026

Copy link
Copy Markdown
Author

@copilot i am only concerned with two things: not crashing the app and reporting openAI errors to <#CKCL8UMJA> channel

Done. The /text, /image, and /prompt-with-history error handlers in ai.controller.ts now each call webService.sendMessage('#muzzlefeedback', ...) alongside the ephemeral user message — so all AI failures are reported to that channel. Tests updated to assert both calls. (commit Report all AI errors to #muzzlefeedback channel in controller error handlers)

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI changed the title Add controller error-path tests and document OpenAI resilience config Add controller error-path tests, channel error reporting, and document OpenAI resilience config Jun 23, 2026
@sfreeman422 sfreeman422 requested a review from Copilot June 23, 2026 16:15

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

Comment on lines +78 to +81
void webService.sendMessage(
'#muzzlefeedback',
`AI /image error for <@${user_id}>: ${e instanceof Error ? e.message : String(e)}`,
);
Comment on lines +98 to +101
void webService.sendMessage(
'#muzzlefeedback',
`AI /prompt-with-history error for <@${request.user_id}>: ${e instanceof Error ? e.message : String(e)}`,
);
Comment thread README.md
| **Timeout** | Each individual request is aborted after `OPENAI_TIMEOUT_MS` milliseconds. |
| **Concurrency limiter** | At most `OPENAI_CONCURRENCY` calls are in-flight at once; excess requests are rejected immediately. |
| **Prometheus metrics** | Counters and histograms for requests, retries, failures, circuit opens, and latency are registered via `prom-client` (expose the registry on a `/metrics` endpoint if desired). |
| **Slack alert on 429** | When the rate-limit error reaches the application layer, an alert is posted to `#muzzlefeedback`. |
await Promise.resolve();
expect(sendEphemeral).toHaveBeenCalled();
expect(sendEphemeral).toHaveBeenCalledWith('C1', expect.stringContaining('hello'), 'U1');
expect(sendMessage).toHaveBeenCalledWith('#muzzlefeedback', expect.stringContaining('boom'));

await Promise.resolve();
expect(sendEphemeral).toHaveBeenCalledWith('C1', expect.stringContaining('draw a cat'), 'U1');
expect(sendMessage).toHaveBeenCalledWith('#muzzlefeedback', expect.stringContaining('image boom'));

await Promise.resolve();
expect(sendEphemeral).toHaveBeenCalledWith('C1', expect.stringContaining('summarize'), 'U1');
expect(sendMessage).toHaveBeenCalledWith('#muzzlefeedback', expect.stringContaining('history boom'));
Comment on lines +58 to +61
void webService.sendMessage(
'#muzzlefeedback',
`AI /text error for <@${user_id}>: ${e instanceof Error ? e.message : String(e)}`,
);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants