fix: suppress net.ErrClosed on concurrent Close#564
Conversation
07d6409 to
b76fd5d
Compare
Close and CloseNow returned a wrapped net.ErrClosed when the internal read loop or write path won the casClosing race before the user-facing call. The defer that suppresses net.ErrClosed was only registered on the CAS winner path. Separate user-facing close tracking from internal close coordination. A userClosed atomic tracks whether the user has called Close/CloseNow before. Combined with isClosed, this gives three distinct behaviors: - First user call, close in progress or peer closed: nil. - Concurrent user calls while close is in progress: nil, returns instantly without blocking. - User call after close has fully completed: net.ErrClosed.
b76fd5d to
e5aa423
Compare
|
/coder-agents-review |
|
Review posted | Chat Review history
deep-review v0.5.0 | Round 1 | Last posted: Round 1, 8 findings (1 P2, 3 P3, 2 Nit, 2 Note), COMMENT. Review Finding inventoryFindings
Contested and acknowledged(none) Round logRound 1Panel. 1 P2, 3 P3, 2 Nit, 2 Note new. 1 pre-existing Note dropped. Reviewed against d099e16..e5aa423. Panel: Bisky, Hisoka, Mafu-san, Mafuuu, Pariston, Ging-Go, Gon, Leorio, Takumi, Meruem, Killua, Chopper. About deep-reviewCRF = Coder Review Finding (P0-P4, Nit, Note)
|
There was a problem hiding this comment.
Clean diagnosis, proportional fix. The userClosed atomic correctly separates user intent from internal close coordination, and both Close and CloseNow are updated symmetrically. The commit message is the standout: it names the condition, the mechanism, and the fix, then enumerates all three behavioral states.
The dominant finding is the missing regression test. Nine of twelve reviewers flagged it independently. The race this PR fixes (internal read loop wins casClosing before the user calls Close) has no test coverage, and no existing test distinguishes the old behavior (wrapped net.ErrClosed) from the new behavior (nil). A bug fix without a test for the bug is incomplete work.
Two doc comments need attention: Close's comment claims repeated calls are "no-ops" (they return net.ErrClosed post-close), and CloseNow's comment is silent on the topic entirely.
Severity breakdown: 1 P2, 3 P3, 2 Nit, 2 Note.
"I tried to build a simpler alternative using only isClosed() as the early guard, but it returns net.ErrClosed on a first user call when the internal close already completed, which is wrong from the user's perspective: their first Close should succeed regardless of internal state. The userClosed separation is the minimal mechanism that preserves that invariant." (Pariston)
close.go:92
P3 [CRF-1] Doc comment claims "Additional calls to Close are no-ops" but the new behavior returns net.ErrClosed after the connection is fully closed. The three behaviors in the commit message (first call nil, concurrent calls nil, post-close net.ErrClosed) are the actual contract; the doc should match. (Netero, Pariston)
🤖
close.go:130
P3 [CRF-4] CloseNow's doc comment says nothing about repeated or concurrent call behavior. The three behaviors in the commit message apply to CloseNow identically, but someone writing defer c.CloseNow() that might race with another goroutine has no way to know from the godoc what happens. (Leorio)
🤖
🤖 This review was automatically generated with Coder Agents.
| if err != nil { | ||
| return err | ||
| } | ||
| if c.userClosed.Swap(true) && c.isClosed() { |
There was a problem hiding this comment.
P2 [CRF-3] No regression test for the race condition this PR fixes. The bug: when the internal read loop (read.go:361) or write path (write.go:365) wins casClosing() before the user calls Close, the old code returned a wrapped net.ErrClosed. No existing test exercises this interleaving.
The existing
TestConcurrentClosePingexercises concurrent ping+close but not any close+close path.TestConnClosePropagationexercises close propagation but not the return value contract for the caller that loses the casClosing race. (Mafuuu)
I verified this by reverting the PR changes and running all close-related tests 10 times with the race detector on the old code. All passed. (Pariston)
A test that reproduces this: set up a connection with CloseRead, trigger the peer to send a close frame so the local read loop wins casClosing, then call Close from user code and assert nil return. (Bisky P1, Hisoka P2, Mafu-san P2, Mafuuu P2, Meruem P2, Chopper P2, Pariston P3, Leorio P3, Takumi P3)
🤖
| } | ||
| }() | ||
|
|
||
| if c.casClosing() { |
There was a problem hiding this comment.
Nit [CRF-5] The PR description claims concurrent user calls "return instantly without blocking." In practice, a concurrent caller reaches casClosing() -> true, enters waitGoroutines(), and blocks on c.closeReadDone and c.closed with a 15-second timer. The blocking is correct behavior (caller gets confirmation that close completed), but the description overstates it. (Pariston, Mafuuu, Takumi)
🤖
| closing atomic.Bool | ||
| closeMu sync.Mutex // Protects following. | ||
| closed chan struct{} | ||
| userClosed atomic.Bool // Set by Close/CloseNow on first user call. |
There was a problem hiding this comment.
P3 [CRF-6] userClosed reads as completion alongside closed, but it tracks initiation. The struct now has three close fields in near-identical tenses (userClosed, closing, closed) where the first means "called" and the last means "done." Also, CloseRead's goroutine (read.go:82) calls the public c.Close(), setting userClosed from internal library code. closeRequested would disambiguate stage and scope without reading the body. (Gon P3, Hisoka Note, Chopper Note)
🤖
| closing atomic.Bool | ||
| closeMu sync.Mutex // Protects following. | ||
| closed chan struct{} | ||
| userClosed atomic.Bool // Set by Close/CloseNow on first user call. |
There was a problem hiding this comment.
Nit [CRF-7] Comment "on first user call" restates what the name and type already show. The user is in the name, first call is derivable from atomic.Bool set once. Trim to // Set by Close/CloseNow. or // Tracks whether the user has called Close/CloseNow. (Gon, Leorio)
🤖
| closing atomic.Bool | ||
| closeMu sync.Mutex // Protects following. | ||
| closed chan struct{} | ||
| userClosed atomic.Bool // Set by Close/CloseNow on first user call. |
There was a problem hiding this comment.
Note [CRF-8] Close state is now tracked across three independent boolean-ish variables (userClosed, closing, closed channel) forming an implicit state machine with ~5 reachable states and undocumented transitions. The current approach works correctly and is proportionate for a bugfix, but the next person who modifies close coordination must understand all three in combination. (Meruem)
🤖
| if err != nil { | ||
| return err | ||
| } | ||
| if c.userClosed.Swap(true) && c.isClosed() { |
There was a problem hiding this comment.
Note [CRF-9] The userClosed.Swap(true) && c.isClosed() check is not atomic (the connection could finish closing between the swap and the channel probe), but the gap is benign. A second user call that falls through instead of returning net.ErrClosed immediately reaches casClosing() -> waitGoroutines(), the defer suppresses net.ErrClosed, and the caller gets nil. The system converges to the correct state on both sides of the gap. (Takumi)
🤖
Close and CloseNow returned a wrapped
net.ErrClosedwhen the internal read loop or write path won thecasClosingrace before the user-facing call. The defer that suppressesnet.ErrClosedwas only registered on the CAS winner path, so the first user call to Close could surface "use of closed network connection" even though the close completed successfully.This separates user-facing close tracking (
userClosedatomic) from internal close coordination (casClosing). Combined withisClosed, this gives three distinct behaviors:nil.nil, returns instantly without blocking.net.ErrClosed.Refs coder/internal#1541