Skip to content

chore: hotfix #3361 to v1.1.x#3362

Open
julienrbrt wants to merge 4 commits into
v1.1.xfrom
julien/hotfix-ws
Open

chore: hotfix #3361 to v1.1.x#3362
julienrbrt wants to merge 4 commits into
v1.1.xfrom
julien/hotfix-ws

Conversation

@julienrbrt

Copy link
Copy Markdown
Member

Overview

Bring #3361 to a v1.1.x branch as main is already v1.2.x because of the key rotation PR.

@julienrbrt julienrbrt changed the base branch from main to v1.1.x June 19, 2026 11:29
@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedJun 19, 2026, 2:41 PM

@claude

claude Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Claude finished @julienrbrt's task in 2m 31s —— View job


Code Review

  • Gather context
  • Review changed files
  • Provide feedback

This hotfix backports PR #3361 to v1.1.x, adding an HTTP polling fallback and transparent WS reconnect when the DA WebSocket connection is unavailable. The approach is solid. A few observations below.


Summary

The change adds two new behaviours:

  1. pkg/da/jsonrpc/Client: when the initial WS dial fails, the client falls back to HTTP and spawns a background goroutine (retryWSLoop) that attempts to re-establish WS every 30 s. On success it swaps the transport function pointers in-place (tryUpgradeWS).

  2. block/internal/da/Subscriber: a new SupportsSubscribe() interface method lets the subscriber route to pollLoop (HTTP polling) instead of followLoop (WS subscription) when the transport doesn't support channels. pollLoop re-checks SupportsSubscribe() on every tick and transitions to followLoop once the transport is upgraded.


Observations

pkg/da/jsonrpc/client.go

  1. retryCtx is derived from context.Background(), not the caller's ctx (client.go:108). This is intentional — the startup context passed to NewWSClient may be short-lived — but it means the goroutine is only ever stopped by Close(). If a caller drops the *Client without calling Close(), the goroutine leaks silently. This pattern is fine as long as callers reliably call Close() (and they appear to, but it's worth a comment). Fix this →

  2. tryUpgradeWS swaps Blob.Internal and Header.Internal under c.mu (client.go:158-159). This is safe because those struct fields are only mutated here. However, c.Blob and c.Header are public exported fields. If any caller caches a pointer to c.Blob.Internal.Submit (or similar) before the swap, they would call the stale HTTP function pointer after upgrade. This appears safe given current usage patterns, but is a subtle invariant.

  3. tryUpgradeWS calls wsClient.Close() inside c.mu.Lock() (client.go:153). This calls wsClient.mu.Lock() on a different object, so no deadlock, but it's worth noting that wsClient.Close() calls wsClient.closer() (the HTTP-to-WS connection), which closes network I/O while c.mu is held. This should be quick, but holding a lock across I/O is generally something to watch.

  4. No immediate retry on startup. retryWSLoop waits for the first tick (30 s) before trying. If the WS endpoint becomes available 5 s after startup, the node polls via HTTP for the full 30 s before upgrading. This is acceptable but could be improved with an initial attempt. Fix this →

block/internal/da/subscriber.go

  1. wg.Add(2) moved after cancel assignment, before goroutine launch — the ordering fix (moving s.lifecycleMu.Unlock() before wg.Add(2)) looks correct and avoids a race where Stop() could observe wg before both goroutines are counted.

  2. pollLoopfollowLoop transition: when SupportsSubscribe() becomes true, pollLoop does s.wg.Add(1) and starts followLoop, then returns (triggering its own wg.Done()). Net goroutine count stays correct. However, there is a brief window where pollDAHeight and followLoop could both be running concurrently (if the ticker fires just as the transition check passes). This is benign — updateHighest is CAS-based and both paths call signalCatchup safely — but worth knowing.

  3. pollDAHeight sets seenSubscriptionEvent = true on any new height. This correctly unblocks catchupLoop's head-reached detection for the polling path.

  4. pollLoop interval fallback: uses daBlockTime or defaults to 2 s if zero. The comment on Celestia's ~12 s block time means polling at 2 s could issue ~6 redundant RPC calls per DA block in the worst case. Consider defaulting to something closer to DA block time (e.g. 6–12 s) if no block time is configured. Not a blocker for a hotfix.

block/internal/da/interface.go / tracing.go

  1. The new SupportsSubscribe() method is correctly threaded through all implementations: client, tracedClient, MockClient, DummyDA. No gaps found.

test/testda/dummy.go

  1. DummyDA.SupportsSubscribe() always returns true (dummy.go:246). This means integration/e2e tests never exercise the polling path. Acceptable for a hotfix, but a follow-up test covering SupportsSubscribe() = false → pollLoop would increase confidence.

Verdict

The implementation is correct and addresses the original bug. The concurrency model (atomics + mutex, CAS in updateHighest, WaitGroup bookkeeping) is sound. The main risks are:

  • Goroutine leak if Close() is not called (low risk given existing patterns).
  • The 30 s cold start delay before first WS retry attempt (minor UX issue).

Both are non-blocking for a hotfix. Ready to merge once CI passes.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 64e7ff28-e256-4413-9901-869401e7c86e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch julien/hotfix-ws

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Caution

Review failed

An error occurred during the review process. Please try again later.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch julien/hotfix-ws

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented Jun 19, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 9.47368% with 86 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.66%. Comparing base (d208cd0) to head (3906e02).

Files with missing lines Patch % Lines
pkg/da/jsonrpc/client.go 10.41% 41 Missing and 2 partials ⚠️
block/internal/da/subscriber.go 7.14% 38 Missing and 1 partial ⚠️
block/internal/da/client.go 33.33% 2 Missing ⚠️
block/internal/da/tracing.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           v1.1.x    #3362      +/-   ##
==========================================
- Coverage   60.98%   60.66%   -0.33%     
==========================================
  Files         127      127              
  Lines       13879    13969      +90     
==========================================
+ Hits         8464     8474      +10     
- Misses       4491     4567      +76     
- Partials      924      928       +4     
Flag Coverage Δ
combined 60.66% <9.47%> (-0.33%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant