Skip to content

RR1-T102 Fixes and hardening and systemd support for cli#16

Open
ucswift wants to merge 1 commit into
masterfrom
develop
Open

RR1-T102 Fixes and hardening and systemd support for cli#16
ucswift wants to merge 1 commit into
masterfrom
develop

Conversation

@ucswift

@ucswift ucswift commented Jun 26, 2026

Copy link
Copy Markdown
Member

Summary

This PR hardens the relay's long-lived LiveKit voice modes (radio, record, dispatch) and adds proper systemd support for the CLI.

Graceful shutdown under systemd / Docker

  • Adds SIGTERM handling in the console host, so systemctl stop and docker stop trigger the same clean teardown path as Ctrl+C instead of killing the process mid-work.

Resilience layer for LiveKit voice modes

  • Introduces a retry-with-exponential-backoff loop and a circuit breaker in RelayServiceBase that wraps the run for LiveKit modes. Faulted runs are automatically restarted after a jittered, capped delay; after a configurable number of consecutive quick failures the circuit opens and the service faults.
  • A run lasting longer than the "healthy run" threshold resets the failure counter, so only rapid repeated failures trip the breaker.
  • Backoff and circuit-breaker thresholds are configurable via new ResilienceOptions.

Hard disconnect detection and reconnection

  • All three voice modes now listen for LiveKit connection state changes. A transient SDK "reconnecting" only degrades the status; a hard disconnect throws so the resilience layer rejoins the room.
  • Recorder and radio modes use linked cancellation tokens / task completion sources so a hard disconnect promptly unwinds the loop and triggers reconnect.

Sentry telemetry for mode lifecycle

  • Adds a new telemetry abstraction (IRelayModeTelemetry) with Sentry-backed and no-op implementations. Mode start, retry, fault, and stop events are reported to Sentry when a DSN is configured.

Tests

  • Adds unit tests covering backoff calculation, circuit-breaker tripping, transient-failure recovery, and the disabled-resilience path.

@Resgrid-Bot

Resgrid-Bot commented Jun 26, 2026

Copy link
Copy Markdown

Code Review Completed! 🔥

The code review was successfully completed based on your current configurations.

Kody Guide: Usage and Configuration
Interacting with Kody
  • Request a Review: Ask Kody to review your PR manually by adding a comment with the @kody start-review command at the root of your PR.

  • Validate Business Logic: Ask Kody to validate your code against business rules by adding a comment with the @kody -v business-logic command.

  • Provide Feedback: Help Kody learn and improve by reacting to its comments with a 👍 for helpful suggestions or a 👎 if improvements are needed.

Current Kody Configuration
Review Options

The following review options are enabled or disabled:

Options Enabled
Bug
Performance
Security
Business Logic

Access your configuration settings here.

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds LiveKit-mode retry/backoff and telemetry plumbing to relay services, hard-disconnect handling in the voice modes, new resilience tests, and Linux systemd deployment updates with SIGTERM-driven console shutdown.

Changes

LiveKit resilience and shutdown

Layer / File(s) Summary
Resilience contracts
Resgrid.Relay.Engine/Configuration/RelayHostOptions.cs, Resgrid.Relay.Engine/Telemetry/IRelayModeTelemetry.cs, Resgrid.Relay.Engine/Telemetry/NullRelayModeTelemetry.cs, Resgrid.Relay.Engine/Telemetry/RelayModeTelemetry.cs
RelayHostOptions gains Resilience; the relay mode telemetry contract, no-op implementation, and DSN-based factory are added.
Sentry telemetry
Resgrid.Relay.Engine/Telemetry/SentryRelayModeTelemetry.cs
SentryRelayModeTelemetry initializes Sentry from the provided options and records mode lifecycle, retry, fault, and disposal events.
Relay service resilience
Resgrid.Relay.Engine/Services/RelayServiceBase.cs, Resgrid.Relay.Engine/Services/DispatchRelayService.cs, Resgrid.Relay.Engine/Services/RadioRelayService.cs, Resgrid.Relay.Engine/Services/RecordRelayService.cs
RelayServiceBase now selects telemetry by mode, wraps ExecuteAsync in retry/backoff/circuit-breaker logic, and the relay services now report LiveKit mode explicitly.
LiveKit disconnect handling
Resgrid.Relay.Engine/Voice/DispatchVoiceMode.cs, Resgrid.Relay.Engine/Voice/RadioMode.cs, Resgrid.Relay.Engine/Voice/RecordMode.cs
The three voice modes now watch session.ConnectionChanged, treat reconnecting as degraded, and throw on non-reconnecting disconnects so the resilience layer can restart them.
Test coverage and internals
Resgrid.Relay.Engine/Properties/AssemblyInfo.cs, Resgrid.Audio.Voice.Tests/Resgrid.Audio.Voice.Tests.csproj, Resgrid.Audio.Voice.Tests/RelayResilienceTests.cs
The engine is exposed to the test assembly, the test project references it, and new NUnit coverage exercises backoff and LiveKit resilience behavior.
Systemd shutdown support
Resgrid.Audio.Relay.Console/Program.cs, deploy/systemd/README.md, deploy/systemd/relay.env.example, deploy/systemd/resgrid-relay.service
The console app now cancels on SIGTERM, and the systemd README, environment example, and unit file are populated for Linux deployment.

Sequence Diagram(s)

sequenceDiagram
  participant RelayServiceBase
  participant IRelayModeTelemetry
  participant DispatchVoiceMode
  participant session
  RelayServiceBase->>IRelayModeTelemetry: ModeStarting(mode)
  RelayServiceBase->>DispatchVoiceMode: ExecuteAsync(cancellationToken)
  session-->>DispatchVoiceMode: ConnectionChanged(reconnecting)
  DispatchVoiceMode->>DispatchVoiceMode: status.LiveKit = Degraded
  session-->>DispatchVoiceMode: ConnectionChanged(hard disconnect)
  DispatchVoiceMode-->>RelayServiceBase: InvalidOperationException
  RelayServiceBase->>IRelayModeTelemetry: ModeRetrying(mode, ex, attempt, nextDelay)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • Resgrid/Relay#13: Modifies the same LiveKit voice mode entrypoints, and this PR extends those flows with hard-disconnect handling and resilience retries.

Poem

I hop through retries, soft and bright,
with SIGTERM moon and Sentry light.
If LiveKit blinks, I leap and mend,
then roll my carrots to the end. 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 4.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title matches the main changes: CLI hardening, resilience fixes, and added systemd support.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch develop

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
Resgrid.Relay.Engine/Services/RelayServiceBase.cs (1)

28-31: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Virtual member IsLiveKitMode is invoked from the base constructor.

Reading IsLiveKitMode (Line 30) during base construction runs the derived override before the derived constructor body executes. It's safe today since all overrides return a constant, but it's a fragile pattern (CA2214) if an override later depends on instance state.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Resgrid.Relay.Engine/Services/RelayServiceBase.cs` around lines 28 - 31, The
base constructor in RelayServiceBase is calling the virtual IsLiveKitMode
property, which can invoke derived overrides before the derived constructor has
finished. Remove the virtual dispatch from the constructor by moving the
decision into a non-virtual initializer, a constructor parameter, or a
static/helper method used by RelayServiceBase, so the telemetry selection no
longer depends on override behavior during construction.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Resgrid.Relay.Engine/Telemetry/RelayModeTelemetry.cs`:
- Around line 13-16: The Create method in RelayModeTelemetry currently returns a
new SentryRelayModeTelemetry per service, which causes each instance to own
SentrySdk.Init and disposal. Refactor SentryRelayModeTelemetry so Sentry SDK
initialization, flush, and disposal are managed by a single shared process-wide
instance instead of per-service ownership, and update Create to reuse that
shared telemetry instance for all LiveKit services.

In `@Resgrid.Relay.Engine/Voice/DispatchVoiceMode.cs`:
- Around line 56-72: `DispatchVoiceMode` leaves `status.LiveKit` stuck at
Degraded after a transient reconnect because the `onConnectionChanged` handler
returns immediately when `change.Connected` is true. Update the
`ConnectionChanged` logic so the same handler that sets Degraded on the
"reconnecting" reason also restores `status.LiveKit` to Connected once the
session reports a healthy connected state, using the existing `disconnect`,
`status`, and `onConnectionChanged` flow.

In `@Resgrid.Relay.Engine/Voice/RadioMode.cs`:
- Around line 86-99: The connection-state handler in RadioMode’s
onConnectionChanged leaves status.LiveKit stuck at Degraded after an
auto-reconnect because the change.Connected path returns early. Update the
reconnect flow so that when the SDK reports a successful reconnect, the status
is restored to Connected (or equivalent recovery state) before returning, and
keep the degraded transition only for the reconnecting case; use the
onConnectionChanged handler and the status.LiveKit updates as the key
touchpoints.

In `@Resgrid.Relay.Engine/Voice/RecordMode.cs`:
- Around line 55-68: `RecordMode` leaves `status.LiveKit` stuck in `Degraded`
after an SDK reconnect because the `VoiceConnectionStateChange` handler returns
immediately on `change.Connected == true` and never restores the state. Update
the reconnect recovery path in the `handler` inside `RecordMode` so that when
the connection becomes connected again it sets `status.LiveKit` back to
`ConnectionState.Connected` before returning, while still preserving the
existing `reconnecting` degraded handling and hard-disconnect restart behavior.
- Around line 84-86: The disconnect path in RecordMode should not throw during a
concurrent shutdown; mirror the RadioMode cancellation guard. In the logic that
awaits Task.WhenAny for VoiceModeRuntime.WaitForCancellationAsync and
disconnect.Task, only throw the InvalidOperationException when
cancellationToken.IsCancellationRequested is false, so a Ctrl+C/SIGTERM shutdown
can complete cleanly without a spurious fault.

---

Nitpick comments:
In `@Resgrid.Relay.Engine/Services/RelayServiceBase.cs`:
- Around line 28-31: The base constructor in RelayServiceBase is calling the
virtual IsLiveKitMode property, which can invoke derived overrides before the
derived constructor has finished. Remove the virtual dispatch from the
constructor by moving the decision into a non-virtual initializer, a constructor
parameter, or a static/helper method used by RelayServiceBase, so the telemetry
selection no longer depends on override behavior during construction.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 802dd7a2-6b94-4576-a289-3fe7aede0078

📥 Commits

Reviewing files that changed from the base of the PR and between 4edaa95 and 7bb8622.

📒 Files selected for processing (19)
  • Resgrid.Audio.Relay.Console/Program.cs
  • Resgrid.Audio.Voice.Tests/RelayResilienceTests.cs
  • Resgrid.Audio.Voice.Tests/Resgrid.Audio.Voice.Tests.csproj
  • Resgrid.Relay.Engine/Configuration/RelayHostOptions.cs
  • Resgrid.Relay.Engine/Properties/AssemblyInfo.cs
  • Resgrid.Relay.Engine/Services/DispatchRelayService.cs
  • Resgrid.Relay.Engine/Services/RadioRelayService.cs
  • Resgrid.Relay.Engine/Services/RecordRelayService.cs
  • Resgrid.Relay.Engine/Services/RelayServiceBase.cs
  • Resgrid.Relay.Engine/Telemetry/IRelayModeTelemetry.cs
  • Resgrid.Relay.Engine/Telemetry/NullRelayModeTelemetry.cs
  • Resgrid.Relay.Engine/Telemetry/RelayModeTelemetry.cs
  • Resgrid.Relay.Engine/Telemetry/SentryRelayModeTelemetry.cs
  • Resgrid.Relay.Engine/Voice/DispatchVoiceMode.cs
  • Resgrid.Relay.Engine/Voice/RadioMode.cs
  • Resgrid.Relay.Engine/Voice/RecordMode.cs
  • deploy/systemd/README.md
  • deploy/systemd/relay.env.example
  • deploy/systemd/resgrid-relay.service

Comment on lines +13 to +16
public static IRelayModeTelemetry Create(RelayTelemetryOptions telemetry, ILogger logger) =>
string.IsNullOrWhiteSpace(telemetry?.Sentry?.Dsn)
? NullRelayModeTelemetry.Instance
: new SentryRelayModeTelemetry(telemetry.Sentry, telemetry.Environment, logger);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In Sentry .NET 6.4.1, is SentrySdk.Init(...)process-global, and what happens if it is called multiple times and each returnedIDisposable is later disposed independently in the same process?

💡 Result:

In Sentry.NET 6.4.1, SentrySdk.Init is effectively process-global [1][2]. The Sentry.NET SDK is designed to be initialized only once during the lifetime of an application [2]. When SentrySdk.Init is called multiple times, the following behaviors occur: 1. Initialization Conflict: Calling Init multiple times is not officially supported and can lead to runtime errors, silent failures, or unexpected behavior [3][4][5]. For example, issues have been reported where double-initialization triggers exceptions (e.g., ArgumentException regarding duplicate dictionary keys in middleware) or IO errors if features like caching are enabled [4][5]. 2. Disposable Behavior: The IDisposable returned by SentrySdk.Init is intended to be used to shut down the SDK and flush queued events [6][7][8]. If you call Init multiple times, each call typically returns a new IDisposable. Disposing of these independently in the same process can cause the SDK to be "disabled" or "shut down" prematurely [9][10]. Once the underlying Hub is disabled by a disposal, subsequent attempts to capture data using the SDK may fail or throw ObjectDisposedException [10]. In most modern.NET applications, you do not need to manually dispose of the IDisposable returned by SentrySdk.Init, as the SDK includes integrations (like AppDomainProcessExitIntegration) that handle flushing and shutdown automatically when the application exits [6][7][9]. Manual disposal is recommended only in specific edge cases, such as when those automatic integrations are disabled, or you need to perform unmonitored work after shutting down Sentry [6][9].

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Map the relevant files first.
git ls-files 'Resgrid.Relay.Engine/Telemetry/*' 'Resgrid.Relay.Engine/Services/*'

echo '--- outline: RelayModeTelemetry.cs ---'
ast-grep outline Resgrid.Relay.Engine/Telemetry/RelayModeTelemetry.cs --view expanded || true

echo '--- outline: SentryRelayModeTelemetry.cs ---'
ast-grep outline Resgrid.Relay.Engine/Telemetry/SentryRelayModeTelemetry.cs --view expanded || true

echo '--- outline: RelayServiceBase.cs ---'
ast-grep outline Resgrid.Relay.Engine/Services/RelayServiceBase.cs --view expanded || true

echo '--- relevant search ---'
rg -n "SentrySdk\.Init|Dispose\(|new SentryRelayModeTelemetry|RelayModeTelemetry\.Create|class SentryRelayModeTelemetry|class NullRelayModeTelemetry" Resgrid.Relay.Engine/Telemetry Resgrid.Relay.Engine/Services

Repository: Resgrid/Relay

Length of output: 1987


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Read only the small relevant slices.
sed -n '1,220p' Resgrid.Relay.Engine/Telemetry/SentryRelayModeTelemetry.cs
printf '\n---\n'
sed -n '1,180p' Resgrid.Relay.Engine/Telemetry/RelayModeTelemetry.cs
printf '\n---\n'
sed -n '1,220p' Resgrid.Relay.Engine/Services/RelayServiceBase.cs

Repository: Resgrid/Relay

Length of output: 13140


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Read only the small relevant slices.
sed -n '1,220p' Resgrid.Relay.Engine/Telemetry/SentryRelayModeTelemetry.cs
printf '\n---\n'
sed -n '1,180p' Resgrid.Relay.Engine/Telemetry/RelayModeTelemetry.cs
printf '\n---\n'
sed -n '1,220p' Resgrid.Relay.Engine/Services/RelayServiceBase.cs

Repository: Resgrid/Relay

Length of output: 13140


Share one Sentry SDK lifecycle for LiveKit services. The LiveKit path creates a new SentryRelayModeTelemetry per service, but that type calls SentrySdk.Init and later disposes the returned handle. Since Sentry’s SDK is process-wide, one service can shut down telemetry for another. Move init/flush/dispose to a single shared instance instead of per-service ownership.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Resgrid.Relay.Engine/Telemetry/RelayModeTelemetry.cs` around lines 13 - 16,
The Create method in RelayModeTelemetry currently returns a new
SentryRelayModeTelemetry per service, which causes each instance to own
SentrySdk.Init and disposal. Refactor SentryRelayModeTelemetry so Sentry SDK
initialization, flush, and disposal are managed by a single shared process-wide
instance instead of per-service ownership, and update Create to reuse that
shared telemetry instance for all LiveKit services.

Comment on lines +56 to +72
// Signals a hard LiveKit disconnect so the run throws and the resilience layer
// rejoins; a transient SDK "reconnecting" only degrades status.
var disconnect = new TaskCompletionSource(TaskCreationOptions.RunContinuationsAsynchronously);
EventHandler<VoiceConnectionStateChange> onConnectionChanged = (_, change) =>
{
if (change.Connected)
return;
if (string.Equals(change.Reason, "reconnecting", StringComparison.OrdinalIgnoreCase))
{
if (status != null)
status.LiveKit = ConnectionState.Degraded;
return;
}
disconnect.TrySetResult();
};
session.ConnectionChanged += onConnectionChanged;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

LiveKit .NET SDK ConnectionChanged event Reconnected state fires with Connected true

💡 Result:

In the LiveKit ecosystem, the LiveKit.NET SDK (often used via the Livekit.Rtc package for real-time media) uses a ConnectionState enumeration to track the status of a room connection [1][2]. Regarding your observation that the connection appears to transition back to a "Connected" state (or indicates Connected = true) during a reconnection event, this is standard behavior for LiveKit SDKs [3]. 1. ConnectionState Enumeration: The SDK typically reports states such as Disconnected, Connecting, Reconnecting, and Connected [3][4][5]. 2. Reconnection Logic: When a connection is interrupted (e.g., due to network instability), the SDK enters a Reconnecting state [6][3]. If the SDK successfully performs a "quick reconnect" to resume the existing session, it transitions the state back to Connected [3]. 3. Why this happens: The Reconnected state you are referring to is often not a separate connection state but a conceptual outcome where the client successfully resumed the existing session without requiring a full re-join [6][3]. Because the session was successfully resumed, the official ConnectionState is Connected [3]. If you are monitoring ConnectionStateChanged events, you should expect to see the state transition from Connected to Reconnecting when the link drops, and then back to Connected once the connection is restored [3][5]. This indicates that the client has successfully regained its active session [6][3].

Citations:


🏁 Script executed:

# Inspect the target file and related LiveKit status updates
set -e
git ls-files 'Resgrid.Relay.Engine/Voice/DispatchVoiceMode.cs' 'Resgrid.Relay.Engine/Voice/*.cs' | sed -n '1,40p'
echo '---'
wc -l Resgrid.Relay.Engine/Voice/DispatchVoiceMode.cs
echo '---'
cat -n Resgrid.Relay.Engine/Voice/DispatchVoiceMode.cs | sed -n '1,220p'
echo '---'
rg -n "LiveKit\s*=\s*ConnectionState\.(Connected|Degraded|Disconnected|Connecting)|ConnectionChanged|reconnecting|Publish|status\." Resgrid.Relay.Engine/Voice -n

Repository: Resgrid/Relay

Length of output: 14040


🏁 Script executed:

# Search for broader status transitions and any reconnect handling elsewhere
set -e
rg -n "ConnectionState\.Connected|ConnectionState\.Degraded|ConnectionState\.Disconnected|ConnectionState\.Connecting|reconnecting|ConnectionChanged" Resgrid.Relay.Engine -g '*.cs'

Repository: Resgrid/Relay

Length of output: 4823


🏁 Script executed:

# Inspect the target file and related LiveKit status updates
set -e
git ls-files 'Resgrid.Relay.Engine/Voice/DispatchVoiceMode.cs' 'Resgrid.Relay.Engine/Voice/*.cs' | sed -n '1,40p'
echo '---'
wc -l Resgrid.Relay.Engine/Voice/DispatchVoiceMode.cs
echo '---'
cat -n Resgrid.Relay.Engine/Voice/DispatchVoiceMode.cs | sed -n '1,220p'
echo '---'
rg -n "LiveKit\s*=\s*ConnectionState\.(Connected|Degraded|Disconnected|Connecting)|ConnectionChanged|reconnecting|Publish|status\." Resgrid.Relay.Engine/Voice -n

Repository: Resgrid/Relay

Length of output: 14040


🏁 Script executed:

# Search for broader status transitions and any reconnect handling elsewhere
set -e
rg -n "ConnectionState\.Connected|ConnectionState\.Degraded|ConnectionState\.Disconnected|ConnectionState\.Connecting|reconnecting|ConnectionChanged" Resgrid.Relay.Engine -g '*.cs'

Repository: Resgrid/Relay

Length of output: 4823


Restore status.LiveKit on reconnect.

reconnecting correctly sets the pill to Degraded, but the early return on change.Connected never clears it. If no new calls are published after auto-reconnect, the status stays degraded even though LiveKit is healthy.

♻️ Reset to Connected on recovery
 				if (change.Connected)
-					return;
+				{
+					if (status != null)
+						status.LiveKit = ConnectionState.Connected;
+					return;
+				}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Signals a hard LiveKit disconnect so the run throws and the resilience layer
// rejoins; a transient SDK "reconnecting" only degrades status.
var disconnect = new TaskCompletionSource(TaskCreationOptions.RunContinuationsAsynchronously);
EventHandler<VoiceConnectionStateChange> onConnectionChanged = (_, change) =>
{
if (change.Connected)
return;
if (string.Equals(change.Reason, "reconnecting", StringComparison.OrdinalIgnoreCase))
{
if (status != null)
status.LiveKit = ConnectionState.Degraded;
return;
}
disconnect.TrySetResult();
};
session.ConnectionChanged += onConnectionChanged;
// Signals a hard LiveKit disconnect so the run throws and the resilience layer
// rejoins; a transient SDK "reconnecting" only degrades status.
var disconnect = new TaskCompletionSource(TaskCreationOptions.RunContinuationsAsynchronously);
EventHandler<VoiceConnectionStateChange> onConnectionChanged = (_, change) =>
{
if (change.Connected)
{
if (status != null)
status.LiveKit = ConnectionState.Connected;
return;
}
if (string.Equals(change.Reason, "reconnecting", StringComparison.OrdinalIgnoreCase))
{
if (status != null)
status.LiveKit = ConnectionState.Degraded;
return;
}
disconnect.TrySetResult();
};
session.ConnectionChanged += onConnectionChanged;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Resgrid.Relay.Engine/Voice/DispatchVoiceMode.cs` around lines 56 - 72,
`DispatchVoiceMode` leaves `status.LiveKit` stuck at Degraded after a transient
reconnect because the `onConnectionChanged` handler returns immediately when
`change.Connected` is true. Update the `ConnectionChanged` logic so the same
handler that sets Degraded on the "reconnecting" reason also restores
`status.LiveKit` to Connected once the session reports a healthy connected
state, using the existing `disconnect`, `status`, and `onConnectionChanged`
flow.

Comment on lines +86 to +99
EventHandler<VoiceConnectionStateChange> onConnectionChanged = (_, change) =>
{
if (change.Connected)
return;
if (string.Equals(change.Reason, "reconnecting", StringComparison.OrdinalIgnoreCase))
{
if (status != null)
status.LiveKit = ConnectionState.Degraded;
return;
}
disconnected = true;
try { disconnectCts.Cancel(); }
catch (ObjectDisposedException) { }
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

status.LiveKit stays Degraded after a successful auto-reconnect.

On reconnecting the handler degrades status, but when the SDK reconnects (change.Connected == true) it returns early (Line 88-89) without restoring Connected. Unlike dispatch mode there's no later success path that resets it, so the radio bridge will report Degraded permanently after the first transient reconnect.

♻️ Reset to Connected on recovery
 				if (change.Connected)
-					return;
+				{
+					if (status != null)
+						status.LiveKit = ConnectionState.Connected;
+					return;
+				}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
EventHandler<VoiceConnectionStateChange> onConnectionChanged = (_, change) =>
{
if (change.Connected)
return;
if (string.Equals(change.Reason, "reconnecting", StringComparison.OrdinalIgnoreCase))
{
if (status != null)
status.LiveKit = ConnectionState.Degraded;
return;
}
disconnected = true;
try { disconnectCts.Cancel(); }
catch (ObjectDisposedException) { }
};
EventHandler<VoiceConnectionStateChange> onConnectionChanged = (_, change) =>
{
if (change.Connected)
{
if (status != null)
status.LiveKit = ConnectionState.Connected;
return;
}
if (string.Equals(change.Reason, "reconnecting", StringComparison.OrdinalIgnoreCase))
{
if (status != null)
status.LiveKit = ConnectionState.Degraded;
return;
}
disconnected = true;
try { disconnectCts.Cancel(); }
catch (ObjectDisposedException) { }
};
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Resgrid.Relay.Engine/Voice/RadioMode.cs` around lines 86 - 99, The
connection-state handler in RadioMode’s onConnectionChanged leaves
status.LiveKit stuck at Degraded after an auto-reconnect because the
change.Connected path returns early. Update the reconnect flow so that when the
SDK reports a successful reconnect, the status is restored to Connected (or
equivalent recovery state) before returning, and keep the degraded transition
only for the reconnecting case; use the onConnectionChanged handler and the
status.LiveKit updates as the key touchpoints.

Comment on lines +55 to +68
EventHandler<VoiceConnectionStateChange> handler = (_, change) =>
{
if (change.Connected)
return;
// "reconnecting" is the SDK auto-recovering — degrade but keep running.
if (string.Equals(change.Reason, "reconnecting", StringComparison.OrdinalIgnoreCase))
{
if (status != null)
status.LiveKit = ConnectionState.Degraded;
return;
}
// Any other Connected=false is a hard disconnect — restart the recorder.
disconnect.TrySetResult(change.Reason);
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

status.LiveKit not restored to Connected after reconnect.

Same pattern as the other voice modes: a reconnecting transition sets Degraded (Line 62-63) but the change.Connected == true recovery path returns early (Line 57-58) without resetting, and record mode has no later success path to clear it.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Resgrid.Relay.Engine/Voice/RecordMode.cs` around lines 55 - 68, `RecordMode`
leaves `status.LiveKit` stuck in `Degraded` after an SDK reconnect because the
`VoiceConnectionStateChange` handler returns immediately on `change.Connected ==
true` and never restores the state. Update the reconnect recovery path in the
`handler` inside `RecordMode` so that when the connection becomes connected
again it sets `status.LiveKit` back to `ConnectionState.Connected` before
returning, while still preserving the existing `reconnecting` degraded handling
and hard-disconnect restart behavior.

Comment on lines +84 to +86
var completed = await Task.WhenAny(VoiceModeRuntime.WaitForCancellationAsync(cancellationToken), disconnect.Task).ConfigureAwait(false);
if (completed == disconnect.Task)
throw new InvalidOperationException($"LiveKit session disconnected ({disconnect.Task.Result}); restarting recorder");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Guard the disconnect throw against concurrent user shutdown.

RadioMode only throws when !cancellationToken.IsCancellationRequested, but here a server-side disconnect that races with a Ctrl+C/SIGTERM shutdown can make WhenAny resolve to disconnect.Task, throwing InvalidOperationException during a clean shutdown and surfacing a spurious fault/Degraded status in the resilience layer. Mirror the radio-mode guard.

♻️ Prefer cancellation over disconnect
 				var completed = await Task.WhenAny(VoiceModeRuntime.WaitForCancellationAsync(cancellationToken), disconnect.Task).ConfigureAwait(false);
-				if (completed == disconnect.Task)
+				if (completed == disconnect.Task && !cancellationToken.IsCancellationRequested)
 					throw new InvalidOperationException($"LiveKit session disconnected ({disconnect.Task.Result}); restarting recorder");
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
var completed = await Task.WhenAny(VoiceModeRuntime.WaitForCancellationAsync(cancellationToken), disconnect.Task).ConfigureAwait(false);
if (completed == disconnect.Task)
throw new InvalidOperationException($"LiveKit session disconnected ({disconnect.Task.Result}); restarting recorder");
var completed = await Task.WhenAny(VoiceModeRuntime.WaitForCancellationAsync(cancellationToken), disconnect.Task).ConfigureAwait(false);
if (completed == disconnect.Task && !cancellationToken.IsCancellationRequested)
throw new InvalidOperationException($"LiveKit session disconnected ({disconnect.Task.Result}); restarting recorder");
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Resgrid.Relay.Engine/Voice/RecordMode.cs` around lines 84 - 86, The
disconnect path in RecordMode should not throw during a concurrent shutdown;
mirror the RadioMode cancellation guard. In the logic that awaits Task.WhenAny
for VoiceModeRuntime.WaitForCancellationAsync and disconnect.Task, only throw
the InvalidOperationException when cancellationToken.IsCancellationRequested is
false, so a Ctrl+C/SIGTERM shutdown can complete cleanly without a spurious
fault.

Comment on lines +93 to +97
using var sigTermRegistration = PosixSignalRegistration.Create(PosixSignal.SIGTERM, context =>
{
context.Cancel = true;
cancellationTokenSource.Cancel();
});

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kody code-review Bug critical

PlatformNotSupportedException thrown at startup on Windows because PosixSignalRegistration.Create(PosixSignal.SIGTERM) runs without a platform guard and sits outside the try block in Main, causing an unhandled exception since the Engine targets net10.0-windows. Guard the registration with OperatingSystem.IsLinux() or OperatingSystem.IsMacOS() so the SIGTERM handler only registers on supported platforms.

PosixSignalRegistration sigTermRegistration = null;
if (OperatingSystem.IsLinux() || OperatingSystem.IsMacOS())
{
	sigTermRegistration = PosixSignalRegistration.Create(PosixSignal.SIGTERM, context =>
	{
		context.Cancel = true;
		cancellationTokenSource.Cancel();
	});
}
Prompt for LLM

File Resgrid.Audio.Relay.Console/Program.cs:

Line 93 to 97:

WHAT: PosixSignalRegistration.Create(PosixSignal.SIGTERM) is called without a platform guard and sits outside any try/catch, but SIGTERM is not a supported PosixSignal on Windows — .NET throws PlatformNotSupportedException. WHY: The Engine targets net10.0-windows, so running the CLI on Windows now crashes at startup (the throw is before the try block, so Main receives an unhandled exception). HOW: Guard the registration with OperatingSystem.IsLinux() (or wrap in try/catch PlatformNotSupportedException) so the SIGTERM handler only registers on platforms that support it.

Suggested Code:

PosixSignalRegistration sigTermRegistration = null;
if (OperatingSystem.IsLinux() || OperatingSystem.IsMacOS())
{
	sigTermRegistration = PosixSignalRegistration.Create(PosixSignal.SIGTERM, context =>
	{
		context.Cancel = true;
		cancellationTokenSource.Cancel();
	});
}

Talk to Kody by mentioning @kody

Was this suggestion helpful? React with 👍 or 👎 to help Kody learn from this interaction.

Comment on lines +71 to 75
session.ConnectionChanged += onConnectionChanged;

using var tts = new ResgridTtsClient(options.Tts, logger);

// TTS reachability is unverified until the first synthesis actually reaches the service:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kody code-review Bug high

Event handler leak on the error path: the ConnectionChanged subscription at line 71 sits outside the try block starting at line 92, so an exception during setup (new ResgridTtsClient or callsApi.GetActiveCallsAsync) bypasses the finally at line 177, leaving the handler attached and publisher.DisposeAsync() uncalled. Move the session.ConnectionChanged += onConnectionChanged subscription inside the existing try block so the finally always detaches and disposes.

using var tts = new ResgridTtsClient(options.Tts, logger);
// ...
var pollSeconds = Math.Max(5, options.DispatchVoice.PollSeconds);
logger.Information(...);

try
{
session.ConnectionChanged += onConnectionChanged;
while (!cancellationToken.IsCancellationRequested)
{
Prompt for LLM

File Resgrid.Relay.Engine/Voice/DispatchVoiceMode.cs:

Line 71 to 75:

WHAT: The `ConnectionChanged` handler is subscribed at line 71 (outside the `try` block that starts at line 92), so an exception during setup (lines 73-90 — `new ResgridTtsClient` or `callsApi.GetActiveCallsAsync`, a network call) skips the `finally` at line 177, leaving the handler attached and the publisher undisposed. WHY: on a transient API failure during backlog prime, the event handler leaks onto the session and `publisher.DisposeAsync()` is never called on that error path. HOW: move the `session.ConnectionChanged += onConnectionChanged` subscription inside the existing `try` block, or extend the `try` to start immediately after the subscription so the `finally` always detaches and disposes.

Suggested Code:

			using var tts = new ResgridTtsClient(options.Tts, logger);
			// ...
			var pollSeconds = Math.Max(5, options.DispatchVoice.PollSeconds);
			logger.Information(...);

			try
			{
			session.ConnectionChanged += onConnectionChanged;
			while (!cancellationToken.IsCancellationRequested)
			{

Talk to Kody by mentioning @kody

Was this suggestion helpful? React with 👍 or 👎 to help Kody learn from this interaction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants