Fix fah-client-bastet:#429 upgraded clients not visible to remote clients#210
Open
Justaphf wants to merge 1 commit into
Open
Fix fah-client-bastet:#429 upgraded clients not visible to remote clients#210Justaphf wants to merge 1 commit into
Justaphf wants to merge 1 commit into
Conversation
Root cause was that the outgoing websocket connection lost its only strong owner in `a271d5d`. The weak reference dies before the handshake, silently, because every async callback is a WeakCall and `Account::connect()` sets `STATE_CONNECTED` in a fire-and-forget manner.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The change that made
Websocket::connectiona weak reference (commita271d5d, "Fixed Websocket circular reference") is correct for incoming websockets, which are owned byEvent::Server(connectionsholds eachConnstrongly). But outgoing websockets — those created viaWebsocket::connect()→HTTP::Client::send()— had no other strong owner. The strongSmartPointer<Conn>returned bysend()is assigned into the now weakconnectionmember and the temporary is then destroyed, dropping the strong count to zero. TheConnOutis deallocated immediately, typically before the handshake (sometimes before DNS) completes.Because every async continuation on that connection is a
WeakCall, none of them fire once the object is gone: noonConnect, noonResponse, noonClose. The failure is completely silent — nothing in the logs.Impact
This is the root cause of Folding@home's "remote machines can't see each other" regression (FoldingAtHome/fah-client-bastet#429). The client's outgoing websocket to the node server dies before it can register, so an affected client never appears as a remote machine to any other machine on the same account, while its local web UI and all ordinary HTTPS API traffic keep working (those connections are owned elsewhere — incoming Conns by the server, API calls by
PendingRequest). Downgrading restores visibility instantly, which matches a pure client-side object-lifetime bug with no server state involved. The problem is platform-independent.Fix
Give the outgoing case an owner: a strong
outConnmember onWebsocket, set inconnect()and released on shutdown and on connect failure. Incoming websockets are untouched and continue to rely on the server for ownership, so the original circular-reference fix is preserved.Key points:
connect()now retains the connection:outConn = client.send(req)withconnection = outConnkept as the existing weak member.outConn/connectionare released beforeonClose()so a reconnect triggered from anonCloseoverride cannot be clobbered. Inshutdown(), the connection is promoted to a local strong pointer, the members are cleared, and only then isclose()called — so re-entrant callbacks fromConn::close()observe a consistent "no connection" state while the local keeps the object alive through the call.idin the client-side path (id = outConn->getID()), mirroring whatupgrade()already does on the server side. PreviouslygetID()returned~0for outgoing websockets because the id was only set on the incoming path.Representative change (
Websocket.cpp,connect()tail):Testing
Built and deployed against the node server; an affected client now registers and appears as a remote machine on all other machines on the same account, across mixed-OS nodes. With the fix in place a healthy connect logs
Logging into node accountat startup (absent on a broken build), and a failed connect now reaches the close path and retries cleanly instead of failing silently.A dedicated regression test (a localhost websocket round-trip that fails fast if the outgoing connection is dropped after
connect()) is in review and will follow as a separate PR, together with a fix for line-ending normalization in the test harness needed for it to pass on native Windows.Use of AI Disclosure
I worked on this with the help of Fabel 5. PR description above is from that source. I have tested this on Windows 11 Pro and Ubuntu 24.04.4 (have it up and running on all of my "production" machines now for 3 weeks. Issue is resolved on both OS and no regression that I can see. Before submitting I did a rebase on latest master and confirmed no issues on fresh builds in both OS. There is no test coverage for WebSockets at all, I'm working on that in a follow-up PR, along with fixes for the test harness for native Windows builds. I do not have a Mac to test on so Mac testing has not been done.