Skip to content

Fix fah-client-bastet:#429 upgraded clients not visible to remote clients#210

Open
Justaphf wants to merge 1 commit into
CauldronDevelopmentLLC:masterfrom
Justaphf:feature/#429
Open

Fix fah-client-bastet:#429 upgraded clients not visible to remote clients#210
Justaphf wants to merge 1 commit into
CauldronDevelopmentLLC:masterfrom
Justaphf:feature/#429

Conversation

@Justaphf

Copy link
Copy Markdown
Contributor

Summary

The change that made Websocket::connection a weak reference (commit a271d5d, "Fixed Websocket circular reference") is correct for incoming websockets, which are owned by Event::Server (connections holds each Conn strongly). But outgoing websockets — those created via Websocket::connect()HTTP::Client::send() — had no other strong owner. The strong SmartPointer<Conn> returned by send() is assigned into the now weak connection member and the temporary is then destroyed, dropping the strong count to zero. The ConnOut is deallocated immediately, typically before the handshake (sometimes before DNS) completes.

Because every async continuation on that connection is a WeakCall, none of them fire once the object is gone: no onConnect, no onResponse, no onClose. The failure is completely silent — nothing in the logs.

Impact

This is the root cause of Folding@home's "remote machines can't see each other" regression (FoldingAtHome/fah-client-bastet#429). The client's outgoing websocket to the node server dies before it can register, so an affected client never appears as a remote machine to any other machine on the same account, while its local web UI and all ordinary HTTPS API traffic keep working (those connections are owned elsewhere — incoming Conns by the server, API calls by PendingRequest). Downgrading restores visibility instantly, which matches a pure client-side object-lifetime bug with no server state involved. The problem is platform-independent.

Fix

Give the outgoing case an owner: a strong outConn member on Websocket, set in connect() and released on shutdown and on connect failure. Incoming websockets are untouched and continue to rely on the server for ownership, so the original circular-reference fix is preserved.

Key points:

  • connect() now retains the connection: outConn = client.send(req) with connection = outConn kept as the existing weak member.
  • Release order matters in both teardown paths. In the failure branch, outConn/connection are released before onClose() so a reconnect triggered from an onClose override cannot be clobbered. In shutdown(), the connection is promoted to a local strong pointer, the members are cleared, and only then is close() called — so re-entrant callbacks from Conn::close() observe a consistent "no connection" state while the local keeps the object alive through the call.
  • Also caches the connection id in the client-side path (id = outConn->getID()), mirroring what upgrade() already does on the server side. Previously getID() returned ~0 for outgoing websockets because the id was only set on the incoming path.

Representative change (Websocket.cpp, connect() tail):

  // Hold a strong reference to the outgoing connection; other refs are weak
  outConn = client.send(req);
  connection = outConn;
  // Cache ID like upgrade() does, since the weak ref may not outlive us
  id = outConn->getID();

Testing

Built and deployed against the node server; an affected client now registers and appears as a remote machine on all other machines on the same account, across mixed-OS nodes. With the fix in place a healthy connect logs Logging into node account at startup (absent on a broken build), and a failed connect now reaches the close path and retries cleanly instead of failing silently.

A dedicated regression test (a localhost websocket round-trip that fails fast if the outgoing connection is dropped after connect()) is in review and will follow as a separate PR, together with a fix for line-ending normalization in the test harness needed for it to pass on native Windows.

Use of AI Disclosure

I worked on this with the help of Fabel 5. PR description above is from that source. I have tested this on Windows 11 Pro and Ubuntu 24.04.4 (have it up and running on all of my "production" machines now for 3 weeks. Issue is resolved on both OS and no regression that I can see. Before submitting I did a rebase on latest master and confirmed no issues on fresh builds in both OS. There is no test coverage for WebSockets at all, I'm working on that in a follow-up PR, along with fixes for the test harness for native Windows builds. I do not have a Mac to test on so Mac testing has not been done.

Root cause was that the outgoing websocket connection lost its only strong owner in `a271d5d`. The weak reference dies before the handshake, silently, because every async callback is a WeakCall and `Account::connect()` sets `STATE_CONNECTED` in a fire-and-forget manner.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant