Skip to content

feat(platform-wallet)!: shared ThreadRegistry for coordinator lifecycle + shutdown UAF/data-loss fixes#3954

Open
Claudius-Maginificent wants to merge 330 commits into
v4.1-devfrom
feat/platform-wallet-shutdown-join
Open

feat(platform-wallet)!: shared ThreadRegistry for coordinator lifecycle + shutdown UAF/data-loss fixes#3954
Claudius-Maginificent wants to merge 330 commits into
v4.1-devfrom
feat/platform-wallet-shutdown-join

Conversation

@Claudius-Maginificent

@Claudius-Maginificent Claudius-Maginificent commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Why this PR exists

What was done

  • rs-dash-async: ThreadRegistry, atomic flag guards, ShutdownReport / WorkerStatus.
  • rs-platform-wallet: CoordinatorLifecycle, registry-driven event-adapter task, shutdown() with weighted join + detached-orphan reporting; ShieldedShutdownIncomplete error + FFI code 19.

Testing

cargo fmt --all --check; cargo clippy --workspace --all-targets -- -D warnings; cargo test -p platform-wallet -p dash-async (dash-async 38; platform-wallet 322 with --all-features; 47 shutdown/registry tests incl. shutdown_joins_all_workers, shutdown_waits_for_in_flight_pass_to_drain, restart_in_place). The 4 Swift files need a separate iOS build check.

Breaking changes

! — adds an FFI error code (19) and a shutdown error variant.

Checklist

  • Self-review performed
  • Tests added/updated

🤖 Co-authored by Claudius the Magnificent AI Agent

lklimek and others added 30 commits May 22, 2026 17:08
…entity-key writes (CMT-005/004)

platform_addrs::apply rejects an entry whose wallet_id differs from the
flush scope with the typed WalletIdMismatch before the INSERT (the native
FK now also backstops it, but the typed error pinpoints the mismatch).

identity_keys::apply rejects an upsert whose entry (identity_id, key_id,
wallet_id) disagrees with its map key / flush scope (IdentityKeyEntryMismatch),
so the typed columns and the serialized blob can never describe different
rows on disk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…peek errors (CMT-007/009/010)

Remove the delete-wallet subcommand (CLI surface only; the library
delete_wallet API stays). peek_schema_version now returns Result and
distinguishes a fresh DB (no history table -> Ok(None)) from a transient
open/query failure (Err), so `migrate` can no longer print a wrong
`applied: N` from a swallowed error. run_backup defers refuse-to-overwrite
to backup_to's typed BackupDestinationExists instead of a duplicate
is_file() guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…i-account, overflow guards)

New tests/sqlite_hardening_3625.rs proves: native FK rejects an orphan
child and an identity_keys row with no identities parent; mixed-wallet
platform-address write fails typed (CMT-005); identity-key entry/key-vs
-blob mismatch is rejected (CMT-004); multi-account UTXOs bucket to their
real account and an undeclared address defaults to 0 (CMT-003/011); the
birth_height and sync_height overflow cases error rather than truncate
(CMT-012/014); a compaction-marker-only wallet survives load (CMT-002).

Update existing tests for the rewritten schema: TC-027 derived-address
insert carries account_index; the error-classification exhaustiveness
table covers the three new variants; TC-072 asserts the delete-wallet
subcommand is gone; FK test header drops the trigger wording. A conn.rs
unit test exercises the foreign-key read-back assertion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ope conn doc (QA-001/004)

TC-048 now explicitly asserts the UTXO row survives the transaction
delete (COUNT==1) with wallet_id/value/account_index preserved, in
addition to spent_in_txid being NULL — so a future change turning the
single-column trigger into a cascading DELETE fails loudly instead of
passing implicitly.

Scope the conn.rs choke-point doc to library connection paths and note
the CLI's read-only peek_schema_version probe opens directly (no
mutations, and open_conn is pub(crate) so the separate bin target can't
reach it).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ency

CMT-003's pure-SQL account_index lookup removed the last reference to
key-wallet-manager; cargo machete (macOS CI) flagged it as unused. No
`.rs` file references key_wallet_manager under any cfg/feature, and the
optional dep carried no extra feature activation, so removal is clean —
not papered over with a machete ignore. Drop the dependency line and its
`dep:key-wallet-manager` wiring from the sqlite feature.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cOS path symmetry

`backup_to` canonicalizes its returned path; on macOS the temp dir lives
under `/var` (symlink to `/private/var`), so the canonical `written`
path no longer prefixes the un-canonicalized `out_dir` and the
`starts_with` assertion failed (Linux has no such symlink, hence
green locally). Canonicalize the expected dir before the check,
mirroring TC-032. TC-031 is the only affected assertion — TC-032 was
already symmetric, and TC-035 only feeds the canonical path back into
restore_from without comparing it to a temp path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…L/SHM perms (CMT-001/003/004)

CMT-001 (BLOCKING data-loss): restore_from() unlinked <dest>-wal /
-shm BEFORE staging + integrity / schema / max-version checks. A failed
gate left the main DB intact (NamedTempFile drop) but the sidecars were
gone — un-checkpointed pages lost. Move the unlink to immediately
BEFORE tmp.persist() so both succeed atomically or neither runs.

CMT-003 (HIGH security): the create_new() call in run_to() created
backup files at the umask-default mode before apply_secure_permissions
chmodded them to 0o600. Set OpenOptionsExt::mode(0o600) under cfg(unix)
so the file is never world/group readable, even briefly. Keep the
post-open chmod as defense-in-depth (Windows no-op, re-tightens).

CMT-004 (HIGH security): apply_secure_permissions only chmodded the
main DB; WAL/SHM siblings inherited the process umask. Extend the same
helper to chmod <path>-wal / <path>-shm to 0o600 when present (silent
skip when absent, Windows no-op). One function, idempotent — matches
the user preference for a single API surface.

Regressions:
- tests/sqlite_restore_staged_validation::rejected_restore_leaves_wal_shm_siblings_intact
- tests/sqlite_permissions::wal_and_shm_sidecars_are_chmodded_0o600

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…ore_from (CMT-010)

restore_from() was streaming the entire source DB into a NamedTempFile
in the destination's parent dir BEFORE running the schema-history /
max-version gate. A caller passing a valid-but-huge incompatible DB
(CLI / FFI / UI import) filled the wallet partition before being
rejected.

Add cheap source-side schema-history presence + MAX(version) checks
against the source handle that was already opened for integrity_check
— failing fast saves the disk. Keep the staged-copy gate as the
TOCTOU-safe final check: a concurrent swap during the restore window
still gets rejected because the second check binds to the bytes
actually persisted.

Factored shared logic into migrations::max_supported_version() +
migrations::assert_schema_version_supported() so the open()-path gate
(CMT-005, next commit) reuses the same code.

Regression: tests/sqlite_restore_staged_validation::forward_version_rejected_before_staging

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
… failure (CMT-002 data-loss)

delete_wallet_inner drained the per-wallet buffered changeset BEFORE
the existence check / auto-backup / DELETE transaction, but never
restored it on any pre-commit failure. The operator's delete correctly
reported the backup or SQL error, yet the wallet's pending writes
vanished from the buffer even though no DELETE actually committed.

Mirror flush_inner's take/restore discipline: hold the drained
changeset in a Cell<Option<...>>, on success consume it AFTER
tx.commit(), on any error path restore it into the buffer via
self.buffer.restore(wallet_id, cs). If the restore itself fails (e.g.
poisoned buffer mutex), log the lost changeset at error level — the
delete error still surfaces (it's the primary cause).

Regression: tests/sqlite_delete_wallet::delete_wallet_restores_buffer_on_backup_failure
points auto_backup_dir at an unwritable path, primes a buffered
changeset, asserts (a) delete returns AutoBackupDirUnwritable and
(b) the buffered writes survive — a follow-up flush lands them.

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…ic with restore_from (CMT-005)

SqlitePersister::open ran the migration probe (count_pending) and
refinery::run, but never compared the on-disk MAX(version) against the
highest embedded migration. An older binary opening a DB produced by a
newer release saw pending_count == 0 and silently proceeded to decode
forward-schema blob columns — the migration version IS the entire
compatibility boundary for this crate (per blob.rs:4-6).

Add the same SchemaVersionUnsupported gate restore_from enforces,
calling the shared migrations::assert_schema_version_supported helper
introduced in the CMT-010 commit. Sites are now symmetric.

Regression: tests/sqlite_open_version_gate::open_rejects_forward_schema_version
forges a version=1_000_000 row, asserts the second open fails with
SchemaVersionUnsupported.

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…ent stores (CMT-008)

delete_wallet_inner previously released and re-acquired the connection
Mutex three times (existence-check / auto-backup / DELETE-tx). A
parallel store(wallet_id, cs) could slip in between any pair of those
scopes — in Immediate mode the racing flush landed AFTER the DELETE
commit (wallet reappears), in Manual mode a buffered store survived
the delete and was persisted by the next commit_writes() against a
freshly-recreated parent.

Approach (a): acquire self.conn()? ONCE at the top of
delete_wallet_inner and hold it across the full pipeline so Immediate
stores block on flush_inner's own self.conn(). For Manual mode (where
store doesn't take conn at all), do a post-commit second drain of the
buffer to discard any racing-buffered changeset — the wallet is being
removed, those writes are intentionally void. run_auto_backup already
accepted &Connection so no signature changes were required.

Regression: tests/sqlite_delete_wallet::concurrent_store_does_not_resurrect_deleted_wallet
spawns a worker hammering store() while the main thread deletes,
commits any remaining writes, asserts every PER_WALLET_TABLES row
count for that wallet_id is zero.

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…e / typed AssetLockEntryMismatch (CMT-006/007)

CMT-006: blob::decode used bincode's NoLimit config, so an attacker-
controlled length prefix inside a Vec/String/map could drive the
allocator BEFORE the trailing-byte gate ever fired — a crafted backup
that passed restore_from's integrity check could OOM the host on the
next load(). Switch encode + decode to .with_limit::<16 MiB>() (well
above any legitimate per-row payload) and surface the cap-exceeded
case as a NEW typed variant `WalletStorageError::BlobTooLarge {
len_bytes, limit_bytes }` so operators can distinguish hostile/
corrupt oversize from a structural decode failure. Generic decode
failures (trailing bytes, schema mismatch) still surface as the
existing `BlobDecode` per user preference for precise variants.

CMT-007: asset_locks::decode_row decoded the typed outpoint from
op_bytes but then constructed TrackedAssetLock using the BLOB's
out_point + account_index. A torn write or restored corruption that
diverged blob-vs-typed would silently mis-bucket the lock under the
wrong account. Add a cross-check symmetric with IdentityKeyEntryMismatch
under a NEW typed variant `WalletStorageError::AssetLockEntryMismatch {
typed_outpoint, blob_outpoint, typed_account_index, blob_account_index }`.

is_transient() + error_kind_str() updated for both new variants — the
match remains wildcard-free per the existing #[non_exhaustive]-rejection
note in error.rs.

Regressions:
- src/sqlite/schema/blob.rs::tests::decode_rejects_oversize_blob_with_blob_too_large
- tests/sqlite_hardening_3625::asset_lock_typed_vs_blob_mismatch_rejected

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…inner blob (CMT-009)

IdentityKeyWire::into_entry decoded public_key_bincode with
bincode::decode_from_slice but bound the consumed count to `_`. A
corrupt or forward-incompatible inner payload whose IdentityPublicKey
prefix is valid would deserialize successfully even with garbage past
the typed length — leaving identity_keys weaker than every other blob
path. The outer blob::decode already enforces the same gate.

Capture consumed and surface BlobDecode("unexpected trailing bytes in
identity_keys.public_key_bincode") on mismatch (using the existing
blob_decode helper per user preference — trailing-bytes is the
canonical BlobDecode case, not a typed-variant case).

Regression: src/sqlite/schema/identity_keys::tests::into_entry_rejects_trailing_bytes_in_public_key_bincode

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…te (CMT-011)

list_active was byte-for-byte identical to load_state — same SQL,
same decode loop. Per the docstring's own admission, every persisted
status is considered active (consumed locks leave via
AssetLockChangeSet::removed), so the missing WHERE clause was
intentional but made the two functions semantically indistinguishable.

Option (a) per user preference: drop list_active, migrate the one
caller (tests/sqlite_persist_roundtrip.rs:377) to load_state, refresh
load_state's docstring to subsume the consumed-locks comment, and
verify zero remaining references with grep.

Also patches the wildcard-free match in tests/sqlite_error_classification.rs
to cover the new AssetLockEntryMismatch / BlobTooLarge variants added
in the prior CMT-006/007 commit (the exhaustiveness gate is what
this test exists to enforce).

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…MT-012)

The smoke test previously accepted any non-success via
assert!(!out.status.success(), ...) so a runtime failure unrelated to
unknown-subcommand handling (panic during clap setup, OOM, signal
kill) would still pass. Assert the explicit clap usage code Some(2)
before checking stderr content — costs nothing, tightens the contract.

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
- sqlite_delete_wallet.rs: drop a useless std::fs::Permissions::from(...).
- sqlite_permissions.rs: silence clippy::field_reassign_with_default
  (consistent with the existing test files' top-level allow).

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…atch_only

load_from_persistor now rebuilds every persisted wallet watch-only from
its keyless AccountRegistrationEntry manifest (Wallet::new_watch_only)
and applies the keyless core-state projection on top. No seed material
is touched on the load path: signing keys are derived on demand later
through the MnemonicResolverHandle sign entrypoints, which carry the
fail-closed wrong-seed gate themselves.

Drops the SeedProvider port + WalletSecret/SecretPhrase/SecretSeed
payloads (and the storage CredentialStoreSeedProvider adapter that fed
them) — load no longer needs the abstraction. WrongSeedForDatabase
stays on PlatformWalletError for the sign-path gate. RT suite reshapes
to RT-WO (watch-only round-trip) + RT-Corrupt (per-row decode skip with
SkipReason::CorruptPersistedRow{kind: CorruptKind::MissingManifest}) +
RT-Z (no key material in any LoadOutcome / SkipReason surface).

apply_persisted_core_state and its F2/F3/F4 fixes are unchanged.

AR-7 residual risk on the load path is eliminated (no Wallet of a
signing type is constructed during load, so its Debug-leak surface is
gone from this path).

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…stor

platform_wallet_manager_load_from_persistor is now a 2-arg call
(manager_handle, out_outcome). The Swift host never passed a real
resolver at load time anyway — load is watch-only, signing keys are
derived later on demand through the same MnemonicResolverHandle vtable
the per-call sign entrypoints already use (next commit lands the
wallet_id gate there).

Drops the MnemonicResolverHandle → platform_wallet::SeedProvider
adapter (rehydration_seed_provider.rs); no consumer left.

LoadOutcomeFFI.SkippedWalletFFI.reason_code reshapes to the new
CorruptKind family (100/101/102) — ABI-bump for #3692 since the seed-
availability codes (0/1/2) it previously carried are gone with the
seedless load path.

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…edless FFI

platform_wallet_manager_load_from_persistor is now (handle,
out_outcome) — the resolver argument is gone with the seedless-load
rework. Pass nil for out_outcome (Swift doesn't surface skipped
wallets yet; corrupt rows are logged on the Rust side).

Doc string refreshed to reflect Wallet::new_watch_only as the
underlying load primitive and the on-demand-signing + wrong-seed-gate
contract on the resolver-fed sign entrypoints.

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
… sign-gate split

The wrong-seed detection moves off the load path and onto the resolver-fed
FFI sign entrypoints. That gate + its coverage now ships in PR #3735
(security patch against v3.1-dev), not here. Drop the dangling reference
to the never-existed `sign_wrong_seed_gate.rs` file and point readers at
PR #3735 instead.

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
… staging (ATOM-004 / A-1)

The pre-A-1 path eagerly created an empty SQLite file at `dest` via
`OpenOptions::create_new`, so any later failure (chmod, Backup::new,
run_to_completion) stranded a partial/empty .db at the caller-supplied
path. For auto-backup callers that's retention-dir pollution; for the
CLI's manual `backup_to(file_path)` it's junk at a user path with no
cleanup.

Stage the page-step copy into a `NamedTempFile` in `dest`'s parent
(same pattern `restore_from` already uses). chmod 600 the temp BEFORE
persist so the destination inherits owner-only mode via the atomic
rename — no umask-default window. The temp drops naturally on any
error before persist.

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…od temp before persist (ATOM-005/010 / A-2 / A-5)

A-2: the try-lock arm probed the destination lock, dropped the handle
immediately, then ran stages 4-7 (WAL/SHM unlink, stage, validate,
persist) unlocked. A concurrent opener between probe and persist could
race the rename and end up holding a fd against the unlinked old
inode. Hold the exclusive lock guard across the entire restore body;
emit a structured `tracing::warn` on filesystems where flock is
unsupported instead of silently bypassing the safety net.

A-5: chmod the temp file to 0o600 BEFORE persist so the destination
inherits owner-only mode via atomic rename. Pre-A-5 the chmod ran
post-persist — a rare chmod failure left the new DB live while the
function returned Err, giving the caller a mixed result.

N-6: lifts the atomicity-guarantee paragraph from inline comments
into the public rustdoc of `restore_from`.

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…per-wallet errors (N-1)

The old `commit_writes` iterated dirty wallets with `?` — the first
per-wallet failure aborted the loop. Wallets ordered after stayed
buffered with no caller-visible signal of which ones, leading to
silent data left in memory after the operator thinks the commit ran.

Approach (a) from the spec: return `Result<CommitReport, _>` carrying
`succeeded` / `failed` / `still_pending` per-wallet outcomes. The
trait `PlatformWalletPersistence` is untouched — `commit_writes` is a
`SqlitePersister`-specific method (not in the trait, no FFI consumer)
so the change is local to in-crate callers.

A `LockPoisoned` short-circuit on the connection mutex still bails
early, shoveling the remaining wallets into `still_pending` rather
than attempting flushes that would all hit the same dead mutex.

Includes regression test ATOM-006 that primes 3 dirty wallets, forces
A to fail fatally, and verifies B + C still flush and the report
captures the partition correctly.

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…Manual mode (N-2 / ATOM-007)

A `SqlitePersister` dropped in `FlushMode::Manual` silently destroyed
every dirty wallet's buffered changeset — the trait has no
`commit_writes` and the buffer dies with the persister. The footgun
surfaced nowhere in the operator logs.

Add `impl Drop` that emits `tracing::error!` with `dirty_wallets` and
`total_fields` structured fields when dropped Manual-mode with a
non-empty buffer. Do NOT auto-flush — flushing can fail, Drop can't
propagate, and a swallow on flush would be a worse failure mode than
the loud log. Immediate-mode persisters skip the check entirely
(every `store` is durable, so the buffer is always empty at drop).

Regression tests assert both the Manual-mode log emits AND the
Immediate-mode path stays silent.

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…SQLite codes (ATOM-008 / A-4)

Pre-A-4 only `DatabaseBusy` / `DatabaseLocked` were classified as
transient. Operational disk-full (`SQLITE_FULL`), I/O failures
(`SQLITE_IOERR`), and memory pressure (`SQLITE_NOMEM`) are also
recoverable environmental conditions — the operator clears disk /
the kernel finishes its blip / memory frees up — and treating them as
fatal drops buffered state for a problem that resolves itself.

Add `DiskFull`, `SystemIoFailure`, `OutOfMemory` to the transient
set. Extend `error_kind_str` with matching tags
(`sqlite_disk_full` / `sqlite_io_failure` / `sqlite_out_of_memory`)
so operators can grep production logs. Outer match on
`WalletStorageError` stays wildcard-free; inner `ErrorCode` match
uses `_` because the upstream type is `#[non_exhaustive]`. Updates
the classification test with one sample per new code.

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…ors (ATOM-011 / A-6)

Pre-A-6 the loop did `remove_file(&path)?` so the 4th of 7 files
failing left 1-3 gone, 4 errored, 5-7 untouched, and the caller saw an
io::Error with no record of partial progress. The `INTENTIONAL(CODE-007)`
comment acknowledged the limitation; we now address it.

Add `PruneReport::failed_removals: Vec<(PathBuf, io::Error)>` and
collect per-file failures instead of bailing. Catastrophic errors
(`read_dir` itself fails) still surface as `Err`. CLI surfaces failed
removals to stderr and exits non-zero so scripts catch partial
success.

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…pen (ATOM-013 / A-8)

Pre-A-8 `open()` ran migrations against the live DB without first
asserting integrity. Bit-rot or escaped-WAL corruption gets migrated,
and the pre-migration auto-backup snapshots the corrupt state —
making rollback useless.

Add a `PRAGMA integrity_check` between `apply_pragmas` and the
migration run, gated on `had_schema_history` (a freshly-created file
has no btree pages to verify). Surfaces as the typed
`IntegrityCheckFailed { report }` (variant already existed for the
restore_from path) before any schema mutation lands. Promotes
`run_integrity_check` to `pub(crate)` so the open path shares the
same helper as restore_from.

Includes regression test that opens a fresh DB, mutates btree pages
past the SQLite header, closes, and re-opens — assertion that the
typed `IntegrityCheckFailed` surfaces.

Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…lic quiesce against in-flight clear (thepastaclaw 3480216821/3480287466)

Two remaining thepastaclaw findings on the 5d569459 per-key clearing
latch addressed together.

## (1) ClearingGuard set-membership → refcount

Backing store changes from `Mutex<BTreeSet<K>>` to
`Mutex<BTreeMap<K, NonZeroUsize>>`. `hold_clearing` now increments a
per-key holder count (or inserts `NonZeroUsize::new(1)`); the guard's
`Drop` decrements and removes the entry only when the count reaches
zero.

Why: nested or concurrent holders for the same key previously released
the latch on the FIRST drop (set semantics — `insert` of an existing
key was a no-op, `remove` unconditionally removed). Two concurrent
`shielded_clear` invocations on the same handle (reachable through
`HandleStorage::with_item`'s read lock) would silently lapse one
clear's protection when the other's guard dropped. Refcount composes
the holders so the latch stays raised until the LAST guard drops.

The previous test `hold_clearing_nested_guards_share_the_latch`
documented the lapse via comment but never asserted the lapse —
thepastaclaw correctly flagged that it would pass under any
drop-only implementation. Renamed to
`hold_clearing_inner_drop_does_not_lapse_outer_protection` and rewritten
to assert non-vacuously: with both guards alive a start is refused;
after `drop(inner)` while `outer` is still alive, `is_clearing` MUST
still return true AND a fresh `start_clean` MUST still be refused; only
after `drop(outer)` does the latch fully release and a start succeed.
Pre-refactor this test fails on both the `is_clearing` assertion and
the post-inner-drop `is_running` assertion.

`Drop` guards against an underflow (debug_assert) if the entry is
missing — defensive against out-of-band removal that no current call
site does, but cheap to encode.

## (2) Public `quiesce()` gated against the clearing latch

`CoordinatorLifecycle::quiesce` constructs a local `AtomicFlagGuard` on
the same `quiescing` atomic that `clear_shielded_inner` holds raised
continuously via `hold_quiescing_gate`. The guard's `Drop` would
unconditionally lower the gate — lapsing the clear's "no new pass"
barrier and letting a direct `sync_now`/`sync_wallet` slip past
`begin_pass` to re-persist into the wiping store.

Mirror the `spawn_periodic_loop` fix from 432309b: bail at the
start of `quiesce` if `registry.is_clearing(self.worker)` returns
true. Return `WorkerStatus::NotRunning` — the in-flight clear is
already cancel-joining the worker under its own gate, so there is no
new work for `quiesce` to do. Today only `ShieldedSyncManager::quiesce`
is reachable via `shielded_sync_arc`, but the carry-forward concern is
that the trap is one accidental host call away.

Regression test `shielded_quiesce_during_clear_preserves_quiescing_gate`
raises both guards `clear_shielded_inner` raises (latch + gate), calls
`shielded_sync_manager.quiesce()`, asserts the call returns
`NotRunning` AND the gate stays raised. Pre-fix this fails on the gate
assertion.

Verified post-fix:
- cargo fmt --all --check clean.
- cargo clippy --workspace --features shielded --all-targets
  -- -D warnings clean.
- dash-async lib: 35 passed (renamed test stays in place).
- platform-wallet --features shielded: 321 passed (was 320; +1 quiesce
  regression).
- platform-wallet --lib default: 217 passed (unchanged).

Closes the deferred thepastaclaw 3480216821 (refcount suggestion),
3480216826 (test-doesn't-exercise nitpick — the renamed test now does),
and 3480287466 (prior-3 carry-forward, public quiesce vs. held gate).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BcZFae6ABdoMEvCBQZnaJF

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Carried forward from prior review (still valid at bcf9388): prior-2 (blocking — Swift clearLocalState still wipes SwiftData on ErrorShutdownIncomplete), prior-3 (suggestion — public quiesce() drops AtomicFlagGuard on the shared atomic), prior-4 (suggestion — ClearingGuard uses set membership), prior-5 (nitpick — nested-guards test doesn't exercise the documented lapse). prior-1 is partially addressed: the new is_clearing short-circuit + SEC-002 regression test close the common race, but the check and the gate-write are still two distinct critical sections, leaving a narrow residual TOCTOU window — carried forward as a suggestion. No new defects identified in the latest delta (rs-platform-wallet-storage / secrets / keyless rehydration look careful: fail-closed topology check, transactional rollback, per-row skip taxonomy, structured FFI outcome).

🔴 1 blocking | 🟡 3 suggestion(s) | 💬 1 nitpick(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift`:
- [BLOCKING] packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:608-616: [prior-2, carried forward] `clearLocalState` wipes SwiftData rows even when `clearShielded()` throws `ErrorShutdownIncomplete` — violates the documented FFI contract
  The Rust FFI contract for `ErrorShutdownIncomplete` on `platform_wallet_manager_shielded_clear` (rs-platform-wallet-ffi/src/error.rs around the `ShieldedShutdownIncomplete → ErrorShutdownIncomplete` mapping, plus shielded_sync.rs:454) is explicit: when the manager surfaces this code, the Rust shielded store was left intact because Clear aborted before touching it (drain non-clean / orphan-alive backstop). The host must retry the clear and must NOT commit its own persistence wipe; doing so desyncs host rows from the still-populated shared tree. ShieldedService.swift:608-616 catches the throw, logs via `SDKLogger.error`, then falls through unconditionally into the SwiftData delete block at 623-628 (`PersistentShieldedNote`, `PersistentShieldedOutgoingNote`, `PersistentShieldedSyncState`, `PersistentShieldedActivity`). The comment at 606-607 ("Best-effort — failure logs but doesn't abort the wipe") codifies the exact split-brain the FFI contract was designed to prevent. The Swift surface already distinguishes this code: `PlatformWalletResultCode.errorShutdownIncomplete` exists (PlatformWalletResult.swift:47) and decodes into `PlatformWalletError.shutdownIncomplete(detail)` (line 240), so the host can branch on it. Minimum fix: on the typed `shutdownIncomplete` case (or any thrown error from `clearShielded()`, since the Rust contract is fail-closed there), short-circuit the SwiftData wipe and surface the error to the user instead of proceeding.

In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:137-148: [prior-1, partial fix landed] Residual TOCTOU between `is_clearing` check and `reopen_quiescing_gate`
  The new `is_clearing(self.worker)` short-circuit at line 145 closes the common race (clear is already holding the latch when spawn arrives), and the SEC-002 regression test in mod.rs:1071-1108 covers that ordering. But the check (line 145) and the gate-lower (line 148) are not done under a single critical section: `is_clearing` takes `lock_clearing`, returns the boolean, releases the lock; then `reopen_quiescing_gate` writes `false` to the shared atomic. If `clear_shielded_inner` acquires the latch AND raises the quiescing gate between those two points, the reopen at 148 lowers a gate Clear believes it holds continuously across drain → liveness → wipe. In that window a direct `sync_now`/`sync_wallet` going through `begin_pass` (lines 306-341) observes `quiescing == false`, claims the `is_syncing` slot, runs the pass to completion, and persists into the store about to be wiped. `quiesce_under_held_gate`'s `debug_assert!` (line 230) catches the inconsistency only in debug builds; release silently proceeds with the drain. The window is microseconds-narrow, but the design invariant is continuous gate-hold. Structural fix: have the registry expose an atomic observe-and-act under one `lock_clearing()` scope (e.g. `with_clearing_check`) so the gate write is sequenced with the latch read, or move the gate write inside the registry slot critical section alongside the latch check.
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:207-214: [prior-3, carried forward] Public `quiesce()` drops an `AtomicFlagGuard` on the same shared atomic that `clear_shielded` holds raised
  `CoordinatorLifecycle::quiesce` constructs `_quiescing_gate = AtomicFlagGuard::new(&self.quiescing)` over the same `Arc<AtomicBool>` that `clear_shielded_inner` holds raised via `hold_quiescing_gate`. `AtomicFlagGuard::Drop` unconditionally writes `false`, so any concurrent caller routed through the public `quiesce()` while a clear is in flight will lower the clear-flow's continuously-held gate. `ShieldedSyncManager::quiesce` remains `pub` and reachable through the manager's shielded accessor — one accidental call away from re-opening the same lapse Clear was hardened to prevent. Structural fix: refcount/owner-aware re-raise (so concurrent owners compose), or gate the public `quiesce` against the clearing latch and assert no other holder is active before its guard drop is allowed to fire.

In `packages/rs-dash-async/src/registry.rs`:
- [SUGGESTION] packages/rs-dash-async/src/registry.rs:596-613: [prior-4, carried forward] `ClearingGuard` uses set-membership — overlapping guards release the latch on the first drop
  `hold_clearing` does `lock_clearing().insert(key)` into a `BTreeSet<K>`, and `ClearingGuard::Drop` (lines 1099-1103) unconditionally removes the key. With two live guards for the same key (a nested or overlapping clear from a future caller, a test, or an accidental shielded re-clear before the first guard drops), the first drop removes the key while the second guard is still alive — silently lapsing the latch protection the still-live guard is supposed to provide. The new `is_clearing()` check (line 620) inherits the same accounting weakness because it reads the same set membership. The test at lines 1859-1883 acknowledges this in its comment but does not fix the type-level mismatch between the RAII guard's compositional appearance and its set-based semantics. Minimum fix: refcount per key (`BTreeMap<K, NonZeroUsize>`), or add a `debug_assert!` that the insert was novel so accidental nesting fails loudly in debug.

Note: GitHub refused to serve this PR's diff via gh pr diff because it exceeds the 20,000-line limit, so this exact-SHA review is posted body-only. Existing inline threads are reused rather than duplicated.

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Latest delta resolves prior-4 and prior-5: ClearingGuard now refcounts via BTreeMap<K, NonZeroUsize> so nested/concurrent guards compose, and the new test hold_clearing_inner_drop_does_not_lapse_outer_protection is non-vacuous and asserts the lapse window. prior-2 (Swift ShieldedService wipe-on-error) remains unchanged and blocking. prior-3 is partially addressed by the new is_clearing bail in public quiesce(), but the same check-then-act TOCTOU pattern from prior-1 now exists at both spawn_periodic_loop (lines 145-148) and quiesce (lines 217-224); collapsed into a single structural finding.

🔴 1 blocking | 🟡 1 suggestion(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift`:
- [BLOCKING] packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:608-616: [prior-2, carried forward] clearLocalState wipes SwiftData rows even when clearShielded() throws — violates the FFI contract
  Unchanged at 6b655aab. The Rust FFI contract for `platform_wallet_manager_shielded_clear` surfaces `ErrorShutdownIncomplete` specifically when the Rust shielded store was left intact (Clear aborted before touching it via the drain/orphan-alive backstop). The host MUST NOT commit its own persistence wipe in that case, or SwiftData rows desync from the still-populated shared tree — the exact split-brain this stack is designed to prevent.

  Lines 608-616 catch the throw, log via `SDKLogger.error`, then fall through unconditionally into the SwiftData delete block at 623-628 (`PersistentShieldedNote`, `PersistentShieldedOutgoingNote`, `PersistentShieldedSyncState`, `PersistentShieldedActivity`). The comment at 606-607 ("Best-effort — failure logs but doesn't abort the wipe") codifies the wrong invariant. The Swift surface already distinguishes this code path: `PlatformWalletResultCode.errorShutdownIncomplete` decodes into `PlatformWalletError.shutdownIncomplete(detail)`, so the host can branch on it. Minimum fix: on any thrown error from `clearShielded()` (since the Rust contract is fail-closed there), short-circuit the SwiftData wipe and surface the error to the user.

In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:145-224: [prior-1 + residual prior-3] Check-then-act TOCTOU around the shared `quiescing` atomic at both `spawn_periodic_loop` and `quiesce` call sites
  Same structural pattern at two call sites — neither performs the `is_clearing` check atomically with the subsequent write to the shared `quiescing` atomic.

  1. `spawn_periodic_loop` (145-148): `is_clearing(self.worker)` at 145 takes `lock_clearing`, returns the boolean, releases the lock; `reopen_quiescing_gate()` at 148 then writes `false` to the shared atomic. If `clear_shielded_inner` acquires the clearing latch AND raises the gate between those two operations, this `reopen_quiescing_gate()` lowers a gate Clear believes it holds continuously across drain → liveness → wipe — opening a window for a direct `sync_now`/`sync_wallet` to pass `begin_pass` and persist into the store about to be wiped (even though `start_thread` itself will subsequently be refused by the latch).

  2. `quiesce` (217-224): the new fix to bail when `is_clearing` is true closes the dominant case (clear already holds the latch), but introduces the same race shape — `is_clearing` at 217 → unconditional `self.quiescing.store(true, ...)` + `AtomicFlagGuard::new(&self.quiescing)` at 223-224, whose `Drop` (`atomic.rs:25-29`) unconditionally writes `false`. If `clear_shielded_inner` acquires the latch + `hold_quiescing_gate` in the sub-microsecond window between 217 and 223, the racing quiesce installs an `AtomicFlagGuard` over the same atomic the in-flight clear is holding raised; on the quiesce's return the guard drops and lowers the gate, lapsing Clear's continuously-held barrier. The new regression test `shielded_quiesce_during_clear_preserves_quiescing_gate` only exercises the ordering where the clearing latch is acquired BEFORE the quiesce arrives.

  Structural fix (resolves both call sites): expose an atomic observe-and-act on the registry under one `lock_clearing()` scope (e.g. a `with_clearing_check(F)` callback that runs F only while the clearing-set lock is held), or move the gate write inside the registry slot's critical section alongside the latch check. Alternatively, refcount the `quiescing` atomic so concurrent owners compose like `ClearingGuard` now does post-prior-4.

Note: Posted body-only because normal inline poster failed: command failed 1: python3 scripts/review_poster.py dashpay/platform 3954 6b655aa --dry-run.

…builds

PR #3954 review thread #5 (comment 3481707091) asked for runtime visibility
when the registry's reap path runs under `panic = "abort"`: an EpilogueGuard
or AtomicFlagGuard panic during teardown aborts the process instead of
unwinding, so Drop never releases the orphan-liveness gate and the registry's
join-and-orphan-liveness invariant can stay held forever in the worst case.

Stable Rust has NO runtime API to query the active panic strategy — only
`#[cfg(panic = "...")]` at compile time. iOS release builds intentionally
choose `panic = "abort"`, so a hard gate (compile-time error or refusal to
construct) is wrong. A compile-time-gated, one-shot `tracing::warn!` from
`ThreadRegistry::with_reap_backstop` is a strict improvement over the
doc-caveat-only state today: operators get a single audit-trail line at
startup pointing them at the registry.rs / atomic.rs caveats.

The warn is gated by a process-wide `std::sync::Once` so repeated registry
construction (tests, restart paths, multi-instance hosts) fires it once.
The static is hoisted to module scope (also `#[cfg(panic = "abort")]`) so
the in-crate regression test can probe `Once::is_completed()` directly
without subscribing to tracing from a `#[test]`.

Doc caveats on `EpilogueGuard` (registry.rs) and `AtomicFlagGuard`
(atomic.rs) are extended with a one-liner pointing operators at the new
constructor-emitted warn.

Tests:
- `with_reap_backstop_emits_panic_abort_warn_under_abort_builds`
  (only compiled under abort) — proves the Once trips on first call and
  stays one-shot across a second construction. Build-cfg-pinned: if the
  cfg block in `with_reap_backstop` is ever deleted, this test disappears
  with it and CI loses the signal.
- `with_reap_backstop_no_warn_under_unwind` (only compiled under unwind) —
  sentinel proving the no-op cfg branch compiles and `with_reap_backstop`
  keeps behaving like a plain constructor on the default profile.

Verification:
- `cargo test -p dash-async --all-targets` — 36 passed (was 35).
- `cargo clippy -p dash-async --all-targets -- -D warnings` — clean.
- `cargo fmt --all -- --check` — clean.
- `cargo check -p platform-wallet -p platform-wallet-ffi --all-targets` — clean.
- `RUSTFLAGS="-C panic=abort" cargo build -p dash-async --lib` — clean.
  (`--tests` requires nightly `-Zpanic_abort_tests`; lib path covers the
  new cfg branch.)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BcZFae6ABdoMEvCBQZnaJF

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Carried-forward prior findings dominate this incremental review. prior-2 (Swift host wipes SwiftData rows on a failed Rust shielded clear) is STILL VALID and remains blocking — the catch at ShieldedService.swift:611-615 still logs-and-falls-through into the SwiftData delete block, contradicting the Rust FFI's fail-closed ErrorShutdownIncomplete contract. prior-1+prior-3 (TOCTOU between is_clearing and the shared quiescing atomic at both spawn_periodic_loop:145-148 and quiesce:217-224) is STILL VALID as a suggestion. The latest delta (e46249f) only adds a one-shot panic=abort startup warn on with_reap_backstop, doc back-references on AtomicFlagGuard/EpilogueGuard, and two cfg-pinned tests — verified correct, no new defects.

🔴 1 blocking | 🟡 1 suggestion(s)

Findings not posted inline (2)

These findings could not be anchored to the current diff, but they are still part of this review.

  • [BLOCKING] packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:608-616: [prior-2, carried forward] clearLocalState wipes SwiftData rows even when clearShielded() throws — violates the fail-closed FFI contract — Unchanged at e46249f. The Rust FFI for platform_wallet_manager_shielded_clear surfaces ErrorShutdownIncomplete specifically when the Rust shielded store was left intact (Clear aborted before touching it via the drain / orphan-alive backstop). The host must not commit its own persistence wipe...
  • [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:145-224: [prior-1+prior-3, carried forward] Check-then-act TOCTOU between is_clearing and the shared quiescing atomic at both spawn_periodic_loop and quiesce — Unchanged at e46249f. Same structural pattern at two call sites — neither performs the is_clearing check atomically with the subsequent write to the shared quiescing atomic. 1. spawn_periodic_loop (145-148): is_clearing(self.worker) at 145 takes lock_clearing, returns the boolean, rele...
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift`:
- [BLOCKING] packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:608-616: [prior-2, carried forward] clearLocalState wipes SwiftData rows even when clearShielded() throws — violates the fail-closed FFI contract
  Unchanged at e46249fe. The Rust FFI for `platform_wallet_manager_shielded_clear` surfaces `ErrorShutdownIncomplete` specifically when the Rust shielded store was left intact (Clear aborted before touching it via the drain / orphan-alive backstop). The host must not commit its own persistence wipe in that case, or the SwiftData rows desync from the still-populated shared tree — the exact split-brain this stack is engineered to prevent.

  Lines 608-616 catch the throw, log via `SDKLogger.error`, then fall through unconditionally into the SwiftData delete block at 623-628 (`PersistentShieldedNote`, `PersistentShieldedOutgoingNote`, `PersistentShieldedSyncState`, `PersistentShieldedActivity`). The comment at 606-607 ("Best-effort — failure logs but doesn't abort the wipe") codifies the wrong invariant for a fail-closed FFI. The Swift surface already distinguishes this code path: `PlatformWalletResultCode.errorShutdownIncomplete` decodes into `PlatformWalletError.shutdownIncomplete(detail)`, so the host can branch on it.

  Minimum fix: on any thrown error from `clearShielded()`, short-circuit the SwiftData wipe, set `lastError` to a retry-actionable message, and return so the user can retry Clear once Rust quiesces.

In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:145-224: [prior-1+prior-3, carried forward] Check-then-act TOCTOU between `is_clearing` and the shared `quiescing` atomic at both `spawn_periodic_loop` and `quiesce`
  Unchanged at e46249fe. Same structural pattern at two call sites — neither performs the `is_clearing` check atomically with the subsequent write to the shared `quiescing` atomic.

  1. `spawn_periodic_loop` (145-148): `is_clearing(self.worker)` at 145 takes `lock_clearing`, returns the boolean, releases the lock; `reopen_quiescing_gate()` at 148 then writes `false` to the shared atomic. If `clear_shielded_inner` acquires the clearing latch AND raises the gate between those two operations, this `reopen_quiescing_gate()` lowers a gate Clear believes it holds continuously across drain → liveness → wipe — opening a window for a direct `sync_now`/`sync_wallet` to pass `begin_pass` and persist into the store about to be wiped (even though `start_thread` itself will subsequently be refused by the latch).

  2. `quiesce` (217-224): the new `is_clearing` early-return closes the dominant case (clear already holds the latch), but introduces the same race shape — `is_clearing` at 217 → unconditional `self.quiescing.store(true, ...)` + `AtomicFlagGuard::new(&self.quiescing)` at 223-224, whose Drop (`atomic.rs:25-29`) unconditionally writes `false`. If `clear_shielded_inner` acquires the latch + `hold_quiescing_gate` in the sub-microsecond window between 217 and 223, the racing quiesce installs an `AtomicFlagGuard` over the same atomic the in-flight clear is holding raised; on the quiesce's return the guard drops and lowers the gate, lapsing Clear's continuously-held barrier. The new regression test `shielded_quiesce_during_clear_preserves_quiescing_gate` only exercises the ordering where the clearing latch is acquired BEFORE the quiesce arrives.

  Structural fix (resolves both call sites): expose an atomic observe-and-act on the registry under one `lock_clearing()` scope (e.g. a `with_clearing_check(F)` callback that runs F only while the clearing-set lock is held), or move the gate write inside the registry slot's critical section alongside the latch check. Alternatively, refcount the `quiescing` atomic so concurrent owners compose like `ClearingGuard` now does post-prior-4.

lklimek added 2 commits June 26, 2026 14:21
… clearLocalState (thepastaclaw 4580054386 prior-2)

Surface PlatformWalletError.shutdownIncomplete (and any other clearShielded
throw) to lastError and return from clearLocalState before reaching the
SwiftData delete block. The Rust FFI is fail-closed: a thrown error means
the Rust shielded store is still populated (drain timeout, orphan-alive
backstop, or similar), so wiping the SwiftData mirror unconditionally
produced a split-brain — UI showed empty while Rust state lived on. The
docstring at 606-607 also codified the wrong invariant ("Best-effort")
and is reworded to describe the fail-closed semantics.
…is_clearing↔gate TOCTOU (thepastaclaw 4580054386 prior-1+prior-3)

Replace the single-holder AtomicBool `quiescing` flag with an AtomicUsize
refcount, wrapped by a new RefcountedFlagGuard that composes safely across
the two paths that raise it: `quiesce()`'s local guard and the Clear flow's
`hold_quiescing_gate`. The prior design had a TOCTOU between
`is_clearing()` (lock-release) and the subsequent `store(true)` +
AtomicFlagGuard install — a clear that acquired the latch and raised the
gate in that ns-window had its barrier silently lowered on the racing
quiesce's return, because AtomicFlagGuard::Drop unconditionally writes
false. The refcount eliminates that lower-past-other-holder lapse: each
party owns its own ref, Drop decrements, the gate is "raised" iff count > 0.
Pair the drain_hook's fetch_add with a matching fetch_sub in
cancel_join_and_drain so the hook's one-shot raise doesn't leak across
coordinator restarts (this removes the need for reopen_quiescing_gate,
which has been deleted). Add a regression test that fails against the
bool implementation.

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Both prior findings are FIXED at f424caf: the Swift fail-closed path now early-returns before the SwiftData wipe when clearShielded() throws (ShieldedService.swift:618), and the is_clearing↔quiescing TOCTOU is closed by switching to a refcounted Arc with composable RefcountedFlagGuard. One new blocking issue remains: the drain_hook's unconditional fetch_add is paired with a fetch_sub that only fires when registry.quiesce() returns a non-NotRunning status AND when cancel_join_and_drain runs to completion. Both agents independently flag this asymmetry; codex via the production-reachable timeout cancellation path that clear_shielded_inner wraps around quiesce_under_held_gate, claude via the registry's Superseded / inner-NotRunning return sites. Either path leaks a +1 into quiescing, leaving the shielded coordinator permanently gated so every future begin_pass bails.

🔴 1 blocking

Findings not posted inline (1)

These findings could not be anchored to the current diff, but they are still part of this review.

  • [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:275-292: drain_hook's fetch_add is not cancellation-safe and not symmetric with all of registry.quiesce's return paths — quiescing can leak +1 and permanently gate the shielded coordinatordrain_hook unconditionally does quiescing.fetch_add(+1) (line 128) and is paired with a single conditional fetch_sub(-1) here, gated on status != WorkerStatus::NotRunning. That pairing assumes a strict invariant — "the registry fires the hook iff it eventually returns a non-NotRunning sta...
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:275-292: drain_hook's fetch_add is not cancellation-safe and not symmetric with all of registry.quiesce's return paths — quiescing can leak +1 and permanently gate the shielded coordinator
  `drain_hook` unconditionally does `quiescing.fetch_add(+1)` (line 128) and is paired with a single conditional `fetch_sub(-1)` here, gated on `status != WorkerStatus::NotRunning`. That pairing assumes a strict invariant — "the registry fires the hook iff it eventually returns a non-NotRunning status, and the wallet-side future always runs to completion past line 285." Neither half of that invariant holds:

  1. **Cancellation / timeout path (production-reachable).** `clear_shielded_inner` (manager/mod.rs:412-419) wraps `quiesce_under_held_gate()` in `tokio::time::timeout`. A wedged shielded background pass can keep the inner future suspended past line 702 of `registry.quiesce` (drain hook has already been awaited and `fetch_add(+1)` has run) while it polls the join loop. When the outer timeout elapses, the future is dropped, so the matching `fetch_sub` at line 286 never runs. The `_clearing_gate` in `clear_shielded_inner` then drops, releasing only its own contribution. Net result: `quiescing` stays at +1 for the lifetime of this `CoordinatorLifecycle`, so every subsequent `begin_pass` bails on the `quiescing.load() > 0` check at line 378. The shielded coordinator is silently and permanently quiesced — manual `sync_now`, `sync_wallet`, and the background loop all stop doing work — even though the host application has handled the timeout and moved on.

  2. **Superseded / inner-NotRunning return paths inside `registry.quiesce`.** The early exit at registry.rs:693 (cancel and handle both absent → drain hook never fires) is safe under the current condition. The two NotRunning returns *inside* the join loop are not: `Step::Superseded` (registry.rs:760/777) and the inner `Step::NotRunning` (registry.rs:763/777) both happen *after* the hook has already fired at line 702, but return `WorkerStatus::NotRunning`, so the conditional `fetch_sub` is skipped — same leak. Today these branches are unreachable in the wallet because the `closing` / `hold_clearing` latches prevent same-key restart racing `quiesce`, but the comment at lines 277-284 documents an invariant stricter than what `registry.quiesce` actually guarantees. The PR description explicitly anticipates rs-dapi adopting `ThreadRegistry`; any future consumer that calls `quiesce()` without an external restart-prevention mutex inherits a silent gate leak.

  The fundamental issue is the asymmetric ownership of the refcount: the registry's `drain_hook` raises the gate, but the wallet has to release it from outside the registry, with no signal for "did the hook actually fire on this call." The cleanest fixes are (a) remove the `drain_hook` contribution to `quiescing` entirely — the public `quiesce()` already raises the gate manually before calling `cancel_join_and_drain` (line 236), and `quiesce_under_held_gate` requires the caller to hold one, so the hook's raise is redundant; or (b) restructure so the hook's contribution is held by a cancellation-safe guard the registry hands back to the caller via `quiesce`'s return value, so Drop on any path (including future cancellation) releases the hook's ref. Either way, the conditional decrement here is unsafe and the comment overstates the registry's contract.

…meout cancellation (thepastaclaw 4580546708)

The prior commit (f424caf) introduced an asymmetric refcount on
`quiescing`: drain_hook did fetch_add inside an async block, while the
matching fetch_sub lived in cancel_join_and_drain past the registry
await. Under clear_shielded_inner's tokio::time::timeout wrap, a timeout
firing mid-drain dropped the cancel_join_and_drain future after the hook
synchronously ran fetch_add but before fetch_sub could execute — leaking
the ref permanently and silently disabling the shielded coordinator for
the rest of the process (every begin_pass bails on the `> 0` check).

The hook was redundant: CoordinatorLifecycle::quiesce already raises its
own ref via RefcountedFlagGuard::raise before cancel_join_and_drain, and
quiesce_under_held_gate requires the caller to hold one. Both
RefcountedFlagGuards are cancellation-safe (Drop runs on every exit path
including a future dropped at an outer await). The lifecycle's own
docstring already documented that it raises its own gate independently
because registry.quiesce can early-return without firing the hook. The
system-wide shutdown path (registry.shutdown iterating registry.quiesce)
no longer gets a gate-raise; acceptable because that path is the destroy
contract — direct sync_now during a destroy drain is not a real concern.

Add regression test that fails against the leak and passes with the fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BcZFae6ABdoMEvCBQZnaJF

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Latest delta correctly resolves prior-1 by removing the registry-side drain_hook (worker_config now returns drain: None) and relying on the lifecycle's cancellation-safe RefcountedFlagGuard. The new regression test (quiesce_cancellation_during_drain_does_not_leak_quiescing_refcount) is non-vacuous. New issue introduced by the same delta: PlatformWalletManager::shutdown() goes through registry.shutdown() -> registry.quiesce(key), never through CoordinatorLifecycle::quiesce(), so the per-coordinator quiescing gate is no longer raised at all during manager shutdown. The pre-fix drain hook used to provide that barrier; both reviewers independently flag its loss. The mod.rs:455 docstring is now stale relative to the implementation, and the prior race window where a concurrent host-issued sync_now/sync_wallet can slip past begin_pass during destroy is reopened.

🔴 1 blocking

Carried-forward prior findings

  • Prior f424caf blocker is FIXED: FIXED at commit 36aa3a3 via the recommended fix (a) from the prior review: CoordinatorLifecycle::worker_config() now returns drain: None (coordinator_lifecycle.rs:111-117), so no registry-side fetch_add is ever paired with a wallet-side conditional fetch_sub. The gate is now raised exclusively by the lifecycle's own RefcountedFlagGuardquiesce()'s local guard at line 221 and the Clear flow's hold_quiescing_gate — both of whose Drop is cancellation-safe (runs on future-drop unwind). New non-vacuous regression test quiesce_cancellation_during_drain_does_not_leak_quiescing_refcount (lines 613-711) aborts a quiesce() future mid-drain while a background pass is parked in std::thread::sleep and asserts the refcount returns to 0 — it would time out against the prior drain-hook design. Both the cancellation/timeout leak path and the Superseded/inner-NotRunning paths from the prior finding are eliminated by removing the asymmetric ownership entirely. (Note: the removal also eliminates the gate during PlatformWalletManager::shutdown() — see the new blocking finding above.)

New findings in the latest delta

In packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:

  • [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:99-117: Manager shutdown path no longer raises the quiescing gate — direct sync_now/sync_wallet can slip past begin_pass during FFI destroy
    Setting worker_config().drain = None correctly closes the cancellation-safety hole flagged in prior-1, but it also removes the only mechanism by which PlatformWalletManager::shutdown() raised the per-coordinator quiescing gate. shutdown() (manager/mod.rs:487-489) calls self.registry.shutdown().await, which iterates workers and calls self.quiesce(key) on the registry directly (registry.rs:834-866). It never goes through CoordinatorLifecycle::quiesce() (the only path that now raises the gate via the local RefcountedFlagGuard at line 221) or quiesce_under_held_gate() (the Clear flow's path). With drain: None, registry.quiesce runs no hook and the gate stays at 0 throughout the entire teardown.

Consequences:

  1. Production race during FFI destroy. The FFI bridge invokes PlatformWalletManager::shutdown(), which is meant to be the use-after-free barrier: cancel the periodic loops, drain any in-flight pass, then return so the host can free the persister and event-handler context. With the gate never raised, a concurrent host-issued sync_now / sync_wallet on another thread can pass begin_pass's quiescing > 0 check (line 357), claim is_syncing, run its body after the registry has cancelled/joined the background loop, and call persister.store(...) or fire a host completion callback after destroy returned and the host freed the callback context. The pre-fix drain hook's fetch_add ran inside registry.quiesce and held the gate up across each per-worker teardown, blocking exactly this race. That barrier is now gone for the shutdown path, even though it is still in place for lifecycle.quiesce() direct callers and the shielded Clear flow.

  2. Stale documented invariant. The mod.rs:455 docstring on shutdown() still claims "Each worker's drain raises its quiescing gate, cancels the loop, and joins its OS thread / task" and uses that to argue the use-after-free contract holds. The implementation no longer matches that contract: the docstring will mislead anyone (including future Claude) reasoning about destroy-time safety.

Fix options: (a) have PlatformWalletManager::shutdown() route per-coordinator teardown through each coordinator's lifecycle.quiesce() instead of (or in addition to) registry.shutdown(), so the lifecycle-owned RefcountedFlagGuard is in scope across the drain — its Drop is already cancellation-safe so this does not reintroduce prior-1's leak; (b) reintroduce a cancellation-safe drain hook (e.g. a hook that returns the RefcountedFlagGuard to be held by the registry across the same await frame as the join poll, dropped on every exit path including future cancellation), so registry-level teardown still raises the gate without the asymmetric fetch_add/fetch_sub of the original design. Either way, update the mod.rs:455 docstring so the documented destroy-time invariant matches what the code actually does.

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:
- [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:99-117: Manager shutdown path no longer raises the quiescing gate — direct sync_now/sync_wallet can slip past begin_pass during FFI destroy
  Setting `worker_config().drain = None` correctly closes the cancellation-safety hole flagged in prior-1, but it also removes the only mechanism by which `PlatformWalletManager::shutdown()` raised the per-coordinator `quiescing` gate. `shutdown()` (manager/mod.rs:487-489) calls `self.registry.shutdown().await`, which iterates workers and calls `self.quiesce(key)` on the registry directly (registry.rs:834-866). It never goes through `CoordinatorLifecycle::quiesce()` (the only path that now raises the gate via the local `RefcountedFlagGuard` at line 221) or `quiesce_under_held_gate()` (the Clear flow's path). With `drain: None`, `registry.quiesce` runs no hook and the gate stays at 0 throughout the entire teardown.

Consequences:

1. Production race during FFI `destroy`. The FFI bridge invokes `PlatformWalletManager::shutdown()`, which is meant to be the use-after-free barrier: cancel the periodic loops, drain any in-flight pass, then return so the host can free the persister and event-handler context. With the gate never raised, a concurrent host-issued `sync_now` / `sync_wallet` on another thread can pass `begin_pass`'s `quiescing > 0` check (line 357), claim `is_syncing`, run its body after the registry has cancelled/joined the background loop, and call `persister.store(...)` or fire a host completion callback after `destroy` returned and the host freed the callback context. The pre-fix drain hook's `fetch_add` ran inside `registry.quiesce` and held the gate up across each per-worker teardown, blocking exactly this race. That barrier is now gone for the shutdown path, even though it is still in place for `lifecycle.quiesce()` direct callers and the shielded Clear flow.

2. Stale documented invariant. The mod.rs:455 docstring on `shutdown()` still claims "Each worker's drain raises its `quiescing` gate, cancels the loop, and joins its OS thread / task" and uses that to argue the use-after-free contract holds. The implementation no longer matches that contract: the docstring will mislead anyone (including future Claude) reasoning about destroy-time safety.

Fix options: (a) have `PlatformWalletManager::shutdown()` route per-coordinator teardown through each coordinator's `lifecycle.quiesce()` instead of (or in addition to) `registry.shutdown()`, so the lifecycle-owned `RefcountedFlagGuard` is in scope across the drain — its Drop is already cancellation-safe so this does not reintroduce prior-1's leak; (b) reintroduce a cancellation-safe drain hook (e.g. a hook that returns the `RefcountedFlagGuard` to be held by the registry across the same await frame as the join poll, dropped on every exit path including future cancellation), so registry-level teardown still raises the gate without the asymmetric fetch_add/fetch_sub of the original design. Either way, update the mod.rs:455 docstring so the documented destroy-time invariant matches what the code actually does.

Note: Posted as a body-only review because GitHub refuses gh pr diff for this PR with PullRequest.diff too_large (>20,000 lines), so the normal inline mapper could not run.

… in shutdown()

Dropping the registry drain_hook (worker_config().drain = None) closed
the cancellation-safety leak but removed the only mechanism that raised
the per-coordinator `quiescing` gate during PlatformWalletManager::
shutdown(). The race itself was unreachable via FFI (the handle-storage
RwLock serializes destroy's `remove` against every sync_now/sync_wallet
`with_item`), but the manager lost a self-contained invariant and the
mod.rs shutdown docstring became false.

shutdown() now quiesces the weight-0 coordinators first, concurrently,
each via CoordinatorLifecycle::quiesce() whose RefcountedFlagGuard holds
the gate raised across cancel -> join -> in-flight-pass drain. A
concurrent direct sync_now/sync_wallet reaching begin_pass observes
`quiescing > 0` and bails, so no new pass can claim is_syncing
mid-teardown. The guard's Drop returns the refcount to 0 on every exit
path (incl. await-cancellation), so no fetch_add/fetch_sub-across-await
leak is reintroduced. The weight-10 event adapter that sinks the stores
is torn down afterward via ThreadRegistry::shutdown(), preserving the
ordering that keeps the store sink alive past the coordinator loops.

Corrected the shutdown() docstring to describe the lifecycle-owned gate
(not a registry drain hook) and made the multi-thread-runtime precondition
explicit (ThreadRegistry::shutdown already asserts it; shutdown() now
fails fast). Added regression test
shutdown_raises_quiescing_gate_during_teardown asserting the gate is
raised mid-teardown, a concurrent sync_now does no work, and the refcount
returns to 0 (fails against the prior registry-only shutdown body).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01P47BQ3KtJ8YaS4o5gTBJdB
@Claudius-Maginificent

Copy link
Copy Markdown
Collaborator Author

Blocking finding addressed in 6c9ed0a890 — both the verification result and the structural fix.

On the asserted UAF (refuted at HEAD, but the point stands): the concurrent direct sync_now/sync_wallet vs. FFI destroy race is not actually reachable at this HEAD — the FFI HandleStorage RwLock already serializes them: every sync entrypoint runs under with_item (read lock held across block_on(sync_now)), while destroy takes the write lock via remove before block_on(shutdown()). They can never overlap. So no production UAF existed.

But the mechanical observations were correct and worth fixing: with drain: None, shutdown() no longer raised the per-coordinator quiescing gate, the manager leaned on the FFI lock rather than being self-contained, and the mod.rs docstring was stale. We took your option (a):

  • shutdown() now quiesces the weight-0 coordinators first, concurrently, each via CoordinatorLifecycle::quiesce() — whose RefcountedFlagGuard holds the gate raised across cancel → join → in-flight-pass drain. A concurrent direct sync reaching begin_pass now observes quiescing > 0 and bails, even without the FFI lock.
  • No registry drain hook reintroduced, so option (b)'s fetch_add/fetch_sub-across-await leak (the one HEAD just fixed) stays fixed. The guard's Drop returns the refcount to 0 on every exit path including await-cancellation.
  • Weight ordering preserved: the weight-10 event adapter that sinks the stores is torn down after the coordinators, via ThreadRegistry::shutdown().
  • mod.rs:455 docstring rewritten to describe the lifecycle-owned gate (not a registry drain hook) and the multi-thread-runtime precondition (which ThreadRegistry::shutdown already asserts; shutdown() now fails fast up front).
  • New non-vacuous regression test shutdown_raises_quiescing_gate_during_teardown — asserts the gate is raised mid-teardown, a concurrent sync_now does no work, and the refcount returns to 0. Verified it fails against the prior registry-only shutdown() body.

cargo fmt clean, cargo clippy -p platform-wallet --all-targets 0 warnings, cargo test -p platform-wallet 221 passed / 0 failed.

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The prior shutdown-gate blocker is fixed in the current head because manager shutdown now routes coordinator teardown through each lifecycle quiesce(). However, the new implementation releases those lifecycle gates before the registry shutdown latch is set, leaving a narrow but real post-quiesce race where direct syncs or restarts can enter after the coordinator barrier but before the rest of shutdown is closed.

🔴 1 blocking

Findings not posted inline (1)

These findings could not be anchored to the current diff, but they are still part of this review.

  • [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:516-525: Coordinator gates are released before shutdown is latched closedquiesce_coordinators().await raises each lifecycle gate only for the duration of that coordinator's quiesce(). As soon as line 516 returns, those guards have dropped, but ThreadRegistry::shutdown() has not yet set the registry closing latch until line 523 begins polling the registry futur...
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/mod.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:516-525: Coordinator gates are released before shutdown is latched closed
  `quiesce_coordinators().await` raises each lifecycle gate only for the duration of that coordinator's `quiesce()`. As soon as line 516 returns, those guards have dropped, but `ThreadRegistry::shutdown()` has not yet set the registry `closing` latch until line 523 begins polling the registry future. A concurrent Rust caller holding the manager or coordinator `Arc` can enter `sync_now()` / shielded `sync_wallet()` in that gap, see `quiescing == 0`, claim `is_syncing`, and continue persisting or firing callbacks while shutdown tears down the event adapter and returns. A concurrent `start()` can also install a fresh coordinator before `closing` is set; the registry may then correctly report that new worker's shutdown status, but the merge at lines 524-525 overwrites it with the stale status captured before the restart. Hold the coordinator teardown gates, or set an equivalent manager/registry closing barrier, across the entire registry shutdown phase before reporting that shutdown is clean.

…le shutdown teardown

Round-1's shutdown hardening (6c9ed0a) held each coordinator's
`quiescing` gate only for the duration of its own `quiesce()`. The local
`RefcountedFlagGuard`s dropped the moment `quiesce_coordinators()`
returned, so the gates were back at 0 while `registry.shutdown()` tore
down the weight-10 event adapter (the store sink) and reaped orphans.
`begin_pass` checks only the per-coordinator `quiescing` atomic, not the
registry `closing` latch, so a direct `sync_now`/`sync_wallet` in that
window passed the gate, claimed `is_syncing`, and could persist / fire a
host callback while the sink was being torn down — defeating the
self-contained barrier the round-1 change was meant to provide. (Still
unreachable via FFI — the HandleStorage RwLock serializes destroy's
write/remove against sync's read/with_item — but reachable by any
downstream Rust consumer of the public API.)

shutdown() now acquires a held `RefcountedFlagGuard` per coordinator via
`hold_coordinator_gates()` BEFORE quiescing and keeps them in scope across
`quiesce_coordinators()`, `registry.shutdown()`, and the report build —
dropped only at function end. The refcount (composing with each quiesce's
own transient raise) stays > 0 for the entire teardown and returns to 0
only once shutdown completes or unwinds (Drop is cancellation-safe). A
direct sync now bails at `begin_pass` throughout, including during the
event-adapter join.

Also hardened the per-worker report merge: overwrite a coordinator's slot
with its quiesce status ONLY when the registry classified it
`NotRunning`. A concurrent `start()` that slips a fresh worker in before
the registry latches closed is joined+reported live by
`registry.shutdown()`; that live status is now preserved instead of being
clobbered by the stale pre-restart status (which could mask a wedged
worker as clean).

Added `hold_quiescing_gate()` wrappers on the address + identity
coordinators (shielded already had one). New regression test
`shutdown_holds_gate_through_registry_teardown` parks a wedged orphan so
the registry reap blocks ~1s, then asserts the gate is still raised in
that phase, a direct sync bails on `quiescing > 0`, and the refcount
returns to 0 only after shutdown completes (fails against the
6c9ed0a body where the gate drops at quiesce_coordinators return).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01P47BQ3KtJ8YaS4o5gTBJdB
@Claudius-Maginificent

Copy link
Copy Markdown
Collaborator Author

Round-2 blocking finding fixed in 17e1c07901. Good catch — verified it was a real hole in the round-1 fix, not just a phantom.

Confirmed: round-1 held each coordinator's quiescing gate only for the duration of its own quiesce() (the RefcountedFlagGuard was local to lifecycle.quiesce). On quiesce_coordinators() return the gates were back at 0 while registry.shutdown() tore down the weight-10 event adapter + reaped orphans, and begin_pass checks only the per-coordinator quiescing atomic — so a direct sync_now/sync_wallet in that gap could persist while the store sink was torn down. (Unreachable via FFI — the HandleStorage RwLock serializes destroy's remove/write vs sync's with_item/read — but reachable by any downstream Rust consumer of the public API, which defeats the self-containment goal, so it's worth fixing.)

Fix: shutdown() now acquires a held RefcountedFlagGuard per coordinator via hold_coordinator_gates() before quiescing and keeps them in scope across quiesce_coordinators(), registry.shutdown(), and the report build — dropped only at function end. The refcount stays > 0 for the entire teardown (composing with each quiesce's own transient raise) and returns to 0 only when shutdown completes/unwinds (Drop is cancellation-safe). Confirmed holding the gate across registry.shutdown() doesn't deadlock — the registry never reads the per-coordinator quiescing atomic.

Restart/merge: also fixed. The per-worker merge now overwrites a coordinator's slot only when the registry classified it NotRunning, so a concurrent start()'s fresh, live worker status is preserved instead of being clobbered by the stale pre-restart status.

Test: new shutdown_holds_gate_through_registry_teardown parks a wedged orphan so the registry reap blocks ~1s, then asserts the gate is still raised in that phase, a direct sync_now() bails purely on quiescing > 0, and the refcount returns to 0 only after shutdown completes. Verified it fails against the 6c9ed0a890 body and passes after the fix.

cargo fmt clean, cargo clippy --package platform-wallet --all-targets 0 warnings, cargo test --package platform-wallet --lib 222 passed / 0 failed.

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Carried-forward prior findings: prior-1 is FIXED. The current head raises every coordinator gate before coordinator quiesce and holds those guards through registry shutdown and status merging, and I found no new in-scope latest-delta findings. Targeted test execution could not complete because the build needed to download Tenderdash proto sources and the network escalation retry was rejected.

Note: Posted body-only because GitHub refused the PR diff (PullRequest.diff too_large, over 20,000 lines), so inline mapping could not run. The verified output contained no findings.

… join [#3954, independent on v3.1-dev]

Introduce a shared `ThreadRegistry<WalletWorker>` lifecycle engine in
`dash-async` (weight-ordered graceful shutdown: lower weights drain first,
equal weights concurrently; reap-or-park orphan handling; cancellation-safe
`RefcountedFlagGuard`). Migrate the three periodic sync coordinators
(identity / platform-address / shielded) onto a shared `CoordinatorLifecycle`
that drives them through the registry and gates passes via an
`is_syncing` / `quiescing` handshake.

`PlatformWalletManager::shutdown()` now holds every coordinator's quiescing
gate across the whole teardown, quiesces the weight-0 coordinators
concurrently, then drains the weight-10 wallet-event adapter and any parked
orphans, returning a per-worker `ShutdownReport`. `clear_shielded` refuses
(leaving the commitment-tree store intact) when the in-flight pass does not
drain cleanly, surfaced as `ShieldedShutdownIncomplete` /
`ErrorShutdownIncomplete = 19`. FFI `destroy` reports the incomplete-shutdown
code so the Swift host defers freeing its callback context.

Reconstructed independently on a clean v3.1-dev base: only the graceful-
shutdown contribution is included; the drifted #3692 rehydration/storage
work merged into the source branch is excluded.

<sub>🤖 Co-authored by [Claudius the Magnificent](https://github.com/lklimek/claudius) AI Agent</sub>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lklimek lklimek force-pushed the feat/platform-wallet-shutdown-join branch from 17e1c07 to a89e0f9 Compare June 29, 2026 13:44

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

No carried-forward prior findings from the 17e1c07 review; the earlier shutdown-gate issue remains fixed. I verified one new in-scope latest-delta regression: the SQLite same-path open guard was removed while buffers remain per-persister. The destroy finding was dropped because the PR explicitly documents destroy as terminal and the Swift wrapper implements that contract by retaining callback contexts on incomplete shutdown.

🔴 2 blocking

Findings not posted inline (1)

These findings could not be anchored to the current diff, but they are still part of this review.

  • [BLOCKING] packages/rs-platform-wallet-storage/src/sqlite/persister.rs:205-208: Same-path SQLite opens now create divergent buffers — The latest delta removed the process-wide canonical-path registry and the AlreadyOpen error, so open() now returns a fresh SqlitePersister with its own Buffer::new() for a database path that may already be live in this process. SQLite can serialize committed transactions, but it cannot co...
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet-storage/src/sqlite/persister.rs`:
- [BLOCKING] packages/rs-platform-wallet-storage/src/sqlite/persister.rs:205-208: Same-path SQLite opens now create divergent buffers
  The latest delta removed the process-wide canonical-path registry and the `AlreadyOpen` error, so `open()` now returns a fresh `SqlitePersister` with its own `Buffer::new()` for a database path that may already be live in this process. SQLite can serialize committed transactions, but it cannot coordinate these Rust-side buffers: `store()` stages data in the handle-local buffer, `flush()` drains only that buffer, and `load()` reads through only that persister's connection. Two in-process managers using the same wallet DB can therefore miss each other's pending state or commit competing buffered changes in last-flush-wins order. Restore the same-path guard, or move the buffer into a shared path-scoped owner before allowing multiple live handles for one path.
- [BLOCKING] packages/rs-platform-wallet-storage/src/sqlite/persister.rs:205-208: Same-path SQLite opens now create divergent buffers
  The latest delta removed the process-wide canonical-path registry and the `AlreadyOpen` error, so `open()` now returns a fresh `SqlitePersister` with its own `Buffer::new()` for a database path that may already be live in this process. SQLite can serialize committed transactions, but it cannot coordinate these Rust-side buffers: `store()` stages data in the handle-local buffer, `flush()` drains only that buffer, and `load()` reads through only that persister's connection. Two in-process managers using the same wallet DB can therefore miss each other's pending state or commit competing buffered changes in last-flush-wins order. Restore the same-path guard, or move the buffer into a shared path-scoped owner before allowing multiple live handles for one path.

@Claudius-Maginificent

Copy link
Copy Markdown
Collaborator Author

Re: thepastaclaw's 2026-06-29 re-review (persister.rs:205-208, same-path opens → divergent per-persister buffers).

This is a storage-crate concern that belongs to #3968, not this PR. #3954 (rs-dash-async ThreadRegistry + manager lifecycle) has a zero-line diff against rs-platform-wallet-storage — it does not touch persister.rs or the SQLite open path. Under the de-stacking this reads as "removed" only because the storage hardening lives in the sibling PR, not here.

The concern itself is real and already addressed in #3968: the process-wide canonical-path registry + WalletStorageError::AlreadyOpen guard is implemented at persister.rs:78-110 (is_path_open() / open() canonicalizes the path and refuses a second in-process open of the same DB), so a same-path re-open returns AlreadyOpen instead of a divergent-buffer persister.

Dispositioning as out of scope for #3954; tracked and fixed in #3968.

🤖 Co-authored by Claudius the Magnificent AI Agent

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Thanks for the correction. I rechecked current head a89e0f97daefc9363005644a77c0b1b4fa000509 and agree the same-path SQLite open finding is out of scope for #3954: this PR's diff has no rs-platform-wallet-storage files, and the storage persister guard belongs to the sibling storage PR #3968 / its branch context rather than this ThreadRegistry + wallet lifecycle PR.

I am superseding my 2026-06-29 CHANGES_REQUESTED review for #3954. I do not see any remaining in-scope blockers on the current #3954 head from the prior shutdown-gate review trail.

Bring the PR branch up to date with the new base v4.1-dev
(retarget from previous base). Clean merge, no conflicts.

<sub>🤖 Co-authored by [Claudius the Magnificent](https://github.com/lklimek/claudius) AI Agent</sub>
@Claudius-Maginificent Claudius-Maginificent changed the base branch from v4.0-dev to v4.1-dev July 1, 2026 09:34
@github-actions github-actions Bot added this to the v4.1.0 milestone Jul 1, 2026
lklimek and others added 4 commits July 2, 2026 11:04
…d destroy is the join point

No FFI test exercised the `*_stop` symbols. Add a shielded_sync test
module that parks a shielded pass in flight (slow completion callback
holding `is_syncing`) and asserts:

- `platform_wallet_manager_shielded_sync_stop` returns while the pass is
  still parked and well under the park time — it signals cancellation and
  does not `block_on` the quiesce.
- `platform_wallet_manager_destroy` reports a clean `ShutdownReport` and
  only returns once the parked pass has drained — it is the real
  join/drain point, mirroring `identity_sync_stop`/`platform_address_sync_stop`.

Plus a fast no-op case: `stop` on a never-started coordinator is a prompt
`Success` and `destroy` on the idle manager reports clean.

The test needs a mock SDK, so re-enable `dash-sdk`'s `mocks` feature as a
dev-dependency (dev-only; the production cdylib/staticlib never pulls it in).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ator shutdown internals

Apply the present-state / no-ephemeral-ID comment rules to the shutdown
lifecycle code:

- registry.rs: move the `TODO(rs-dapi-adoption)` block out of `start_task`'s
  doc span and shrink it; strip `TC-NNN`/`GAP-NNN` (+ `F2`/`R4`) tags from
  test doc comments AND rename the test functions to plain descriptive
  snake_case; rewrite the "pre-refactor"/"no longer" backward narration to
  state the current invariant directly.
- coordinator_lifecycle.rs: rewrite the `worker_config` and `quiesce` docs
  and the drain-cancellation regression test to present-tense design
  rationale (drop "prior TOCTOU"/"leaked forever"/"pre-fix"/removed-drain-hook
  narration); trim the verbose test-setup rationale to why `pass_entered`
  is needed.
- mod.rs: strip `SEC-001`/`SEC-002`/`TC-002 (F2)` prefixes and the bare
  `F1:` label (keeping the claim); reframe the "against the prior body /
  pre-fix quiesce / prior behaviour" Non-vacuous docs and inline comments
  as hypothetical contrasts.

Preserved deliberately: the module safety-invariants header, the
slots->orphans lock-nesting note, and the LOAD-BEARING Dekker-handshake
SeqCst rationale. Left the forward-hypothetical `Non-vacuous:` comments in
mod.rs (they describe a broken alternative impl, not code history) — valuable
non-vacuousness rationale, not backward narration.

Comment- and test-name-only; no production logic, control flow, or public
API signatures changed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…se onto local shutdown work

Reconciliation merge (real merge commit, no rebase/force-push). origin was
rebased onto v4.1-dev, which advances the base to the Dash 4.0.0 release —
rust-dashcore 0.44.0 (rev 991c6ebe) and grovedb 5.0.0. This merge folds that
base bump into local's verified ThreadRegistry / coordinator-shutdown work.

The 4 conflicts all resolved to local's side — local is strictly ahead on every
one, origin carried zero unique content worth keeping:

- rs-dash-async/src/registry.rs (add/add): byte-identical to local's pre-sweep
  commit; local HEAD additionally has the comment-hygiene sweep.
- rs-platform-wallet/src/manager/coordinator_lifecycle.rs (add/add): same —
  byte-identical to local's pre-sweep commit.
- rs-platform-wallet/src/changeset/core_bridge.rs: local carries the
  drain-buffered-events-on-cancel work plus its two follow-up hardening commits;
  origin lacks all three.
- rs-platform-wallet/src/manager/mod.rs: all 13 conflict hunks are comment-only
  (local's sweep drops ephemeral review IDs and history phrasing). The
  event_manager field stays ungated: local's load_from_persistor reads it
  unconditionally, so origin's shielded-gating (correct only for origin's tree,
  which lacks that rework) would break the non-shielded build.

The ThreadRegistry/CoordinatorThreadStatus collapse is already present on both
sides; nothing to reapply. All shutdown-teardown functionality carried forward
with no functional loss (verified by empty diffs against pre-merge HEAD).

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The prior blocking finding (same-path SQLite opens creating divergent buffers) is FIXED at 6594bf3 — the process-wide open_path_registry, register_open_path/release_open_path, AlreadyOpen guard, and Drop-based release are all restored in packages/rs-platform-wallet-storage/src/sqlite/persister.rs (lines 87-111, 234). The latest delta (a89e0f9..6594bf3) is a documentation/test-name comment-hygiene sweep on registry.rs/coordinator_lifecycle.rs/manager/mod.rs plus a new FFI regression test proving platform_wallet_manager_shielded_sync_stop is cancel-only while platform_wallet_manager_destroy is the actual join point. No new in-scope blocking or suggestion-level defects identified across six parsed Claude-family reviewer outputs.

_Note: Normal review_poster dry-run failed (Traceback (most recent call last):
File "/Users/claw/.openclaw/workspace/scripts/review_poster.py", line 1163, in
result = post_review(
File "/Users/claw/.openclaw/workspace/scripts/review_poster.py", line 874, in post_review
diff_text = _gh("pr", "diff", str(pr_number), "--repo", repo)
File "/Users/claw/.openclaw/workspace/scripts/review_poster.py", line 96, in gh
raise RuntimeError(f"gh {' '.join(args)} failed: {result.stderr.strip()}")
RuntimeError: gh pr diff 3954 --repo dashpay/platform failed: could not find pull request diff: HTTP 406: Sorry, the diff exceeded the maximum number of lines (20000) (https://api.github.com/repos/dashpay/platform/pulls/3954)
PullRequest.diff too_large), so I posted the verified findings as a top-level exact-SHA review body.

Reviewed commit: 6594bf3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants