feat(platform-wallet)!: shared ThreadRegistry for coordinator lifecycle + shutdown UAF/data-loss fixes#3954
feat(platform-wallet)!: shared ThreadRegistry for coordinator lifecycle + shutdown UAF/data-loss fixes#3954Claudius-Maginificent wants to merge 330 commits into
Conversation
…entity-key writes (CMT-005/004) platform_addrs::apply rejects an entry whose wallet_id differs from the flush scope with the typed WalletIdMismatch before the INSERT (the native FK now also backstops it, but the typed error pinpoints the mismatch). identity_keys::apply rejects an upsert whose entry (identity_id, key_id, wallet_id) disagrees with its map key / flush scope (IdentityKeyEntryMismatch), so the typed columns and the serialized blob can never describe different rows on disk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…peek errors (CMT-007/009/010) Remove the delete-wallet subcommand (CLI surface only; the library delete_wallet API stays). peek_schema_version now returns Result and distinguishes a fresh DB (no history table -> Ok(None)) from a transient open/query failure (Err), so `migrate` can no longer print a wrong `applied: N` from a swallowed error. run_backup defers refuse-to-overwrite to backup_to's typed BackupDestinationExists instead of a duplicate is_file() guard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…i-account, overflow guards) New tests/sqlite_hardening_3625.rs proves: native FK rejects an orphan child and an identity_keys row with no identities parent; mixed-wallet platform-address write fails typed (CMT-005); identity-key entry/key-vs -blob mismatch is rejected (CMT-004); multi-account UTXOs bucket to their real account and an undeclared address defaults to 0 (CMT-003/011); the birth_height and sync_height overflow cases error rather than truncate (CMT-012/014); a compaction-marker-only wallet survives load (CMT-002). Update existing tests for the rewritten schema: TC-027 derived-address insert carries account_index; the error-classification exhaustiveness table covers the three new variants; TC-072 asserts the delete-wallet subcommand is gone; FK test header drops the trigger wording. A conn.rs unit test exercises the foreign-key read-back assertion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ope conn doc (QA-001/004) TC-048 now explicitly asserts the UTXO row survives the transaction delete (COUNT==1) with wallet_id/value/account_index preserved, in addition to spent_in_txid being NULL — so a future change turning the single-column trigger into a cascading DELETE fails loudly instead of passing implicitly. Scope the conn.rs choke-point doc to library connection paths and note the CLI's read-only peek_schema_version probe opens directly (no mutations, and open_conn is pub(crate) so the separate bin target can't reach it). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ency CMT-003's pure-SQL account_index lookup removed the last reference to key-wallet-manager; cargo machete (macOS CI) flagged it as unused. No `.rs` file references key_wallet_manager under any cfg/feature, and the optional dep carried no extra feature activation, so removal is clean — not papered over with a machete ignore. Drop the dependency line and its `dep:key-wallet-manager` wiring from the sqlite feature. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cOS path symmetry `backup_to` canonicalizes its returned path; on macOS the temp dir lives under `/var` (symlink to `/private/var`), so the canonical `written` path no longer prefixes the un-canonicalized `out_dir` and the `starts_with` assertion failed (Linux has no such symlink, hence green locally). Canonicalize the expected dir before the check, mirroring TC-032. TC-031 is the only affected assertion — TC-032 was already symmetric, and TC-035 only feeds the canonical path back into restore_from without comparing it to a temp path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…L/SHM perms (CMT-001/003/004) CMT-001 (BLOCKING data-loss): restore_from() unlinked <dest>-wal / -shm BEFORE staging + integrity / schema / max-version checks. A failed gate left the main DB intact (NamedTempFile drop) but the sidecars were gone — un-checkpointed pages lost. Move the unlink to immediately BEFORE tmp.persist() so both succeed atomically or neither runs. CMT-003 (HIGH security): the create_new() call in run_to() created backup files at the umask-default mode before apply_secure_permissions chmodded them to 0o600. Set OpenOptionsExt::mode(0o600) under cfg(unix) so the file is never world/group readable, even briefly. Keep the post-open chmod as defense-in-depth (Windows no-op, re-tightens). CMT-004 (HIGH security): apply_secure_permissions only chmodded the main DB; WAL/SHM siblings inherited the process umask. Extend the same helper to chmod <path>-wal / <path>-shm to 0o600 when present (silent skip when absent, Windows no-op). One function, idempotent — matches the user preference for a single API surface. Regressions: - tests/sqlite_restore_staged_validation::rejected_restore_leaves_wal_shm_siblings_intact - tests/sqlite_permissions::wal_and_shm_sidecars_are_chmodded_0o600 Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…ore_from (CMT-010) restore_from() was streaming the entire source DB into a NamedTempFile in the destination's parent dir BEFORE running the schema-history / max-version gate. A caller passing a valid-but-huge incompatible DB (CLI / FFI / UI import) filled the wallet partition before being rejected. Add cheap source-side schema-history presence + MAX(version) checks against the source handle that was already opened for integrity_check — failing fast saves the disk. Keep the staged-copy gate as the TOCTOU-safe final check: a concurrent swap during the restore window still gets rejected because the second check binds to the bytes actually persisted. Factored shared logic into migrations::max_supported_version() + migrations::assert_schema_version_supported() so the open()-path gate (CMT-005, next commit) reuses the same code. Regression: tests/sqlite_restore_staged_validation::forward_version_rejected_before_staging Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
… failure (CMT-002 data-loss) delete_wallet_inner drained the per-wallet buffered changeset BEFORE the existence check / auto-backup / DELETE transaction, but never restored it on any pre-commit failure. The operator's delete correctly reported the backup or SQL error, yet the wallet's pending writes vanished from the buffer even though no DELETE actually committed. Mirror flush_inner's take/restore discipline: hold the drained changeset in a Cell<Option<...>>, on success consume it AFTER tx.commit(), on any error path restore it into the buffer via self.buffer.restore(wallet_id, cs). If the restore itself fails (e.g. poisoned buffer mutex), log the lost changeset at error level — the delete error still surfaces (it's the primary cause). Regression: tests/sqlite_delete_wallet::delete_wallet_restores_buffer_on_backup_failure points auto_backup_dir at an unwritable path, primes a buffered changeset, asserts (a) delete returns AutoBackupDirUnwritable and (b) the buffered writes survive — a follow-up flush lands them. Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…ic with restore_from (CMT-005) SqlitePersister::open ran the migration probe (count_pending) and refinery::run, but never compared the on-disk MAX(version) against the highest embedded migration. An older binary opening a DB produced by a newer release saw pending_count == 0 and silently proceeded to decode forward-schema blob columns — the migration version IS the entire compatibility boundary for this crate (per blob.rs:4-6). Add the same SchemaVersionUnsupported gate restore_from enforces, calling the shared migrations::assert_schema_version_supported helper introduced in the CMT-010 commit. Sites are now symmetric. Regression: tests/sqlite_open_version_gate::open_rejects_forward_schema_version forges a version=1_000_000 row, asserts the second open fails with SchemaVersionUnsupported. Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…ent stores (CMT-008) delete_wallet_inner previously released and re-acquired the connection Mutex three times (existence-check / auto-backup / DELETE-tx). A parallel store(wallet_id, cs) could slip in between any pair of those scopes — in Immediate mode the racing flush landed AFTER the DELETE commit (wallet reappears), in Manual mode a buffered store survived the delete and was persisted by the next commit_writes() against a freshly-recreated parent. Approach (a): acquire self.conn()? ONCE at the top of delete_wallet_inner and hold it across the full pipeline so Immediate stores block on flush_inner's own self.conn(). For Manual mode (where store doesn't take conn at all), do a post-commit second drain of the buffer to discard any racing-buffered changeset — the wallet is being removed, those writes are intentionally void. run_auto_backup already accepted &Connection so no signature changes were required. Regression: tests/sqlite_delete_wallet::concurrent_store_does_not_resurrect_deleted_wallet spawns a worker hammering store() while the main thread deletes, commits any remaining writes, asserts every PER_WALLET_TABLES row count for that wallet_id is zero. Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…e / typed AssetLockEntryMismatch (CMT-006/007)
CMT-006: blob::decode used bincode's NoLimit config, so an attacker-
controlled length prefix inside a Vec/String/map could drive the
allocator BEFORE the trailing-byte gate ever fired — a crafted backup
that passed restore_from's integrity check could OOM the host on the
next load(). Switch encode + decode to .with_limit::<16 MiB>() (well
above any legitimate per-row payload) and surface the cap-exceeded
case as a NEW typed variant `WalletStorageError::BlobTooLarge {
len_bytes, limit_bytes }` so operators can distinguish hostile/
corrupt oversize from a structural decode failure. Generic decode
failures (trailing bytes, schema mismatch) still surface as the
existing `BlobDecode` per user preference for precise variants.
CMT-007: asset_locks::decode_row decoded the typed outpoint from
op_bytes but then constructed TrackedAssetLock using the BLOB's
out_point + account_index. A torn write or restored corruption that
diverged blob-vs-typed would silently mis-bucket the lock under the
wrong account. Add a cross-check symmetric with IdentityKeyEntryMismatch
under a NEW typed variant `WalletStorageError::AssetLockEntryMismatch {
typed_outpoint, blob_outpoint, typed_account_index, blob_account_index }`.
is_transient() + error_kind_str() updated for both new variants — the
match remains wildcard-free per the existing #[non_exhaustive]-rejection
note in error.rs.
Regressions:
- src/sqlite/schema/blob.rs::tests::decode_rejects_oversize_blob_with_blob_too_large
- tests/sqlite_hardening_3625::asset_lock_typed_vs_blob_mismatch_rejected
Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…inner blob (CMT-009)
IdentityKeyWire::into_entry decoded public_key_bincode with
bincode::decode_from_slice but bound the consumed count to `_`. A
corrupt or forward-incompatible inner payload whose IdentityPublicKey
prefix is valid would deserialize successfully even with garbage past
the typed length — leaving identity_keys weaker than every other blob
path. The outer blob::decode already enforces the same gate.
Capture consumed and surface BlobDecode("unexpected trailing bytes in
identity_keys.public_key_bincode") on mismatch (using the existing
blob_decode helper per user preference — trailing-bytes is the
canonical BlobDecode case, not a typed-variant case).
Regression: src/sqlite/schema/identity_keys::tests::into_entry_rejects_trailing_bytes_in_public_key_bincode
Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…te (CMT-011) list_active was byte-for-byte identical to load_state — same SQL, same decode loop. Per the docstring's own admission, every persisted status is considered active (consumed locks leave via AssetLockChangeSet::removed), so the missing WHERE clause was intentional but made the two functions semantically indistinguishable. Option (a) per user preference: drop list_active, migrate the one caller (tests/sqlite_persist_roundtrip.rs:377) to load_state, refresh load_state's docstring to subsume the consumed-locks comment, and verify zero remaining references with grep. Also patches the wildcard-free match in tests/sqlite_error_classification.rs to cover the new AssetLockEntryMismatch / BlobTooLarge variants added in the prior CMT-006/007 commit (the exhaustiveness gate is what this test exists to enforce). Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…MT-012) The smoke test previously accepted any non-success via assert!(!out.status.success(), ...) so a runtime failure unrelated to unknown-subcommand handling (panic during clap setup, OOM, signal kill) would still pass. Assert the explicit clap usage code Some(2) before checking stderr content — costs nothing, tightens the contract. Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
- sqlite_delete_wallet.rs: drop a useless std::fs::Permissions::from(...). - sqlite_permissions.rs: silence clippy::field_reassign_with_default (consistent with the existing test files' top-level allow). Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…atch_only
load_from_persistor now rebuilds every persisted wallet watch-only from
its keyless AccountRegistrationEntry manifest (Wallet::new_watch_only)
and applies the keyless core-state projection on top. No seed material
is touched on the load path: signing keys are derived on demand later
through the MnemonicResolverHandle sign entrypoints, which carry the
fail-closed wrong-seed gate themselves.
Drops the SeedProvider port + WalletSecret/SecretPhrase/SecretSeed
payloads (and the storage CredentialStoreSeedProvider adapter that fed
them) — load no longer needs the abstraction. WrongSeedForDatabase
stays on PlatformWalletError for the sign-path gate. RT suite reshapes
to RT-WO (watch-only round-trip) + RT-Corrupt (per-row decode skip with
SkipReason::CorruptPersistedRow{kind: CorruptKind::MissingManifest}) +
RT-Z (no key material in any LoadOutcome / SkipReason surface).
apply_persisted_core_state and its F2/F3/F4 fixes are unchanged.
AR-7 residual risk on the load path is eliminated (no Wallet of a
signing type is constructed during load, so its Debug-leak surface is
gone from this path).
Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…stor platform_wallet_manager_load_from_persistor is now a 2-arg call (manager_handle, out_outcome). The Swift host never passed a real resolver at load time anyway — load is watch-only, signing keys are derived later on demand through the same MnemonicResolverHandle vtable the per-call sign entrypoints already use (next commit lands the wallet_id gate there). Drops the MnemonicResolverHandle → platform_wallet::SeedProvider adapter (rehydration_seed_provider.rs); no consumer left. LoadOutcomeFFI.SkippedWalletFFI.reason_code reshapes to the new CorruptKind family (100/101/102) — ABI-bump for #3692 since the seed- availability codes (0/1/2) it previously carried are gone with the seedless load path. Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…edless FFI platform_wallet_manager_load_from_persistor is now (handle, out_outcome) — the resolver argument is gone with the seedless-load rework. Pass nil for out_outcome (Swift doesn't surface skipped wallets yet; corrupt rows are logged on the Rust side). Doc string refreshed to reflect Wallet::new_watch_only as the underlying load primitive and the on-demand-signing + wrong-seed-gate contract on the resolver-fed sign entrypoints. Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
… sign-gate split The wrong-seed detection moves off the load path and onto the resolver-fed FFI sign entrypoints. That gate + its coverage now ships in PR #3735 (security patch against v3.1-dev), not here. Drop the dangling reference to the never-existed `sign_wrong_seed_gate.rs` file and point readers at PR #3735 instead. Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…m-wallet-rehydration
… staging (ATOM-004 / A-1) The pre-A-1 path eagerly created an empty SQLite file at `dest` via `OpenOptions::create_new`, so any later failure (chmod, Backup::new, run_to_completion) stranded a partial/empty .db at the caller-supplied path. For auto-backup callers that's retention-dir pollution; for the CLI's manual `backup_to(file_path)` it's junk at a user path with no cleanup. Stage the page-step copy into a `NamedTempFile` in `dest`'s parent (same pattern `restore_from` already uses). chmod 600 the temp BEFORE persist so the destination inherits owner-only mode via the atomic rename — no umask-default window. The temp drops naturally on any error before persist. Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…od temp before persist (ATOM-005/010 / A-2 / A-5) A-2: the try-lock arm probed the destination lock, dropped the handle immediately, then ran stages 4-7 (WAL/SHM unlink, stage, validate, persist) unlocked. A concurrent opener between probe and persist could race the rename and end up holding a fd against the unlinked old inode. Hold the exclusive lock guard across the entire restore body; emit a structured `tracing::warn` on filesystems where flock is unsupported instead of silently bypassing the safety net. A-5: chmod the temp file to 0o600 BEFORE persist so the destination inherits owner-only mode via atomic rename. Pre-A-5 the chmod ran post-persist — a rare chmod failure left the new DB live while the function returned Err, giving the caller a mixed result. N-6: lifts the atomicity-guarantee paragraph from inline comments into the public rustdoc of `restore_from`. Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…per-wallet errors (N-1) The old `commit_writes` iterated dirty wallets with `?` — the first per-wallet failure aborted the loop. Wallets ordered after stayed buffered with no caller-visible signal of which ones, leading to silent data left in memory after the operator thinks the commit ran. Approach (a) from the spec: return `Result<CommitReport, _>` carrying `succeeded` / `failed` / `still_pending` per-wallet outcomes. The trait `PlatformWalletPersistence` is untouched — `commit_writes` is a `SqlitePersister`-specific method (not in the trait, no FFI consumer) so the change is local to in-crate callers. A `LockPoisoned` short-circuit on the connection mutex still bails early, shoveling the remaining wallets into `still_pending` rather than attempting flushes that would all hit the same dead mutex. Includes regression test ATOM-006 that primes 3 dirty wallets, forces A to fail fatally, and verifies B + C still flush and the report captures the partition correctly. Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…Manual mode (N-2 / ATOM-007) A `SqlitePersister` dropped in `FlushMode::Manual` silently destroyed every dirty wallet's buffered changeset — the trait has no `commit_writes` and the buffer dies with the persister. The footgun surfaced nowhere in the operator logs. Add `impl Drop` that emits `tracing::error!` with `dirty_wallets` and `total_fields` structured fields when dropped Manual-mode with a non-empty buffer. Do NOT auto-flush — flushing can fail, Drop can't propagate, and a swallow on flush would be a worse failure mode than the loud log. Immediate-mode persisters skip the check entirely (every `store` is durable, so the buffer is always empty at drop). Regression tests assert both the Manual-mode log emits AND the Immediate-mode path stays silent. Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…SQLite codes (ATOM-008 / A-4) Pre-A-4 only `DatabaseBusy` / `DatabaseLocked` were classified as transient. Operational disk-full (`SQLITE_FULL`), I/O failures (`SQLITE_IOERR`), and memory pressure (`SQLITE_NOMEM`) are also recoverable environmental conditions — the operator clears disk / the kernel finishes its blip / memory frees up — and treating them as fatal drops buffered state for a problem that resolves itself. Add `DiskFull`, `SystemIoFailure`, `OutOfMemory` to the transient set. Extend `error_kind_str` with matching tags (`sqlite_disk_full` / `sqlite_io_failure` / `sqlite_out_of_memory`) so operators can grep production logs. Outer match on `WalletStorageError` stays wildcard-free; inner `ErrorCode` match uses `_` because the upstream type is `#[non_exhaustive]`. Updates the classification test with one sample per new code. Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…ors (ATOM-011 / A-6) Pre-A-6 the loop did `remove_file(&path)?` so the 4th of 7 files failing left 1-3 gone, 4 errored, 5-7 untouched, and the caller saw an io::Error with no record of partial progress. The `INTENTIONAL(CODE-007)` comment acknowledged the limitation; we now address it. Add `PruneReport::failed_removals: Vec<(PathBuf, io::Error)>` and collect per-file failures instead of bailing. Catastrophic errors (`read_dir` itself fails) still surface as `Err`. CLI surfaces failed removals to stderr and exits non-zero so scripts catch partial success. Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…pen (ATOM-013 / A-8)
Pre-A-8 `open()` ran migrations against the live DB without first
asserting integrity. Bit-rot or escaped-WAL corruption gets migrated,
and the pre-migration auto-backup snapshots the corrupt state —
making rollback useless.
Add a `PRAGMA integrity_check` between `apply_pragmas` and the
migration run, gated on `had_schema_history` (a freshly-created file
has no btree pages to verify). Surfaces as the typed
`IntegrityCheckFailed { report }` (variant already existed for the
restore_from path) before any schema mutation lands. Promotes
`run_integrity_check` to `pub(crate)` so the open path shares the
same helper as restore_from.
Includes regression test that opens a fresh DB, mutates btree pages
past the SQLite header, closes, and re-opens — assertion that the
typed `IntegrityCheckFailed` surfaces.
Co-Authored-By: Claudius the Magnificent (1M context) <noreply@anthropic.com>
…lic quiesce against in-flight clear (thepastaclaw 3480216821/3480287466) Two remaining thepastaclaw findings on the 5d569459 per-key clearing latch addressed together. ## (1) ClearingGuard set-membership → refcount Backing store changes from `Mutex<BTreeSet<K>>` to `Mutex<BTreeMap<K, NonZeroUsize>>`. `hold_clearing` now increments a per-key holder count (or inserts `NonZeroUsize::new(1)`); the guard's `Drop` decrements and removes the entry only when the count reaches zero. Why: nested or concurrent holders for the same key previously released the latch on the FIRST drop (set semantics — `insert` of an existing key was a no-op, `remove` unconditionally removed). Two concurrent `shielded_clear` invocations on the same handle (reachable through `HandleStorage::with_item`'s read lock) would silently lapse one clear's protection when the other's guard dropped. Refcount composes the holders so the latch stays raised until the LAST guard drops. The previous test `hold_clearing_nested_guards_share_the_latch` documented the lapse via comment but never asserted the lapse — thepastaclaw correctly flagged that it would pass under any drop-only implementation. Renamed to `hold_clearing_inner_drop_does_not_lapse_outer_protection` and rewritten to assert non-vacuously: with both guards alive a start is refused; after `drop(inner)` while `outer` is still alive, `is_clearing` MUST still return true AND a fresh `start_clean` MUST still be refused; only after `drop(outer)` does the latch fully release and a start succeed. Pre-refactor this test fails on both the `is_clearing` assertion and the post-inner-drop `is_running` assertion. `Drop` guards against an underflow (debug_assert) if the entry is missing — defensive against out-of-band removal that no current call site does, but cheap to encode. ## (2) Public `quiesce()` gated against the clearing latch `CoordinatorLifecycle::quiesce` constructs a local `AtomicFlagGuard` on the same `quiescing` atomic that `clear_shielded_inner` holds raised continuously via `hold_quiescing_gate`. The guard's `Drop` would unconditionally lower the gate — lapsing the clear's "no new pass" barrier and letting a direct `sync_now`/`sync_wallet` slip past `begin_pass` to re-persist into the wiping store. Mirror the `spawn_periodic_loop` fix from 432309b: bail at the start of `quiesce` if `registry.is_clearing(self.worker)` returns true. Return `WorkerStatus::NotRunning` — the in-flight clear is already cancel-joining the worker under its own gate, so there is no new work for `quiesce` to do. Today only `ShieldedSyncManager::quiesce` is reachable via `shielded_sync_arc`, but the carry-forward concern is that the trap is one accidental host call away. Regression test `shielded_quiesce_during_clear_preserves_quiescing_gate` raises both guards `clear_shielded_inner` raises (latch + gate), calls `shielded_sync_manager.quiesce()`, asserts the call returns `NotRunning` AND the gate stays raised. Pre-fix this fails on the gate assertion. Verified post-fix: - cargo fmt --all --check clean. - cargo clippy --workspace --features shielded --all-targets -- -D warnings clean. - dash-async lib: 35 passed (renamed test stays in place). - platform-wallet --features shielded: 321 passed (was 320; +1 quiesce regression). - platform-wallet --lib default: 217 passed (unchanged). Closes the deferred thepastaclaw 3480216821 (refcount suggestion), 3480216826 (test-doesn't-exercise nitpick — the renamed test now does), and 3480287466 (prior-3 carry-forward, public quiesce vs. held gate). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01BcZFae6ABdoMEvCBQZnaJF
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Carried forward from prior review (still valid at bcf9388): prior-2 (blocking — Swift clearLocalState still wipes SwiftData on ErrorShutdownIncomplete), prior-3 (suggestion — public quiesce() drops AtomicFlagGuard on the shared atomic), prior-4 (suggestion — ClearingGuard uses set membership), prior-5 (nitpick — nested-guards test doesn't exercise the documented lapse). prior-1 is partially addressed: the new is_clearing short-circuit + SEC-002 regression test close the common race, but the check and the gate-write are still two distinct critical sections, leaving a narrow residual TOCTOU window — carried forward as a suggestion. No new defects identified in the latest delta (rs-platform-wallet-storage / secrets / keyless rehydration look careful: fail-closed topology check, transactional rollback, per-row skip taxonomy, structured FFI outcome).
🔴 1 blocking | 🟡 3 suggestion(s) | 💬 1 nitpick(s)
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift`:
- [BLOCKING] packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:608-616: [prior-2, carried forward] `clearLocalState` wipes SwiftData rows even when `clearShielded()` throws `ErrorShutdownIncomplete` — violates the documented FFI contract
The Rust FFI contract for `ErrorShutdownIncomplete` on `platform_wallet_manager_shielded_clear` (rs-platform-wallet-ffi/src/error.rs around the `ShieldedShutdownIncomplete → ErrorShutdownIncomplete` mapping, plus shielded_sync.rs:454) is explicit: when the manager surfaces this code, the Rust shielded store was left intact because Clear aborted before touching it (drain non-clean / orphan-alive backstop). The host must retry the clear and must NOT commit its own persistence wipe; doing so desyncs host rows from the still-populated shared tree. ShieldedService.swift:608-616 catches the throw, logs via `SDKLogger.error`, then falls through unconditionally into the SwiftData delete block at 623-628 (`PersistentShieldedNote`, `PersistentShieldedOutgoingNote`, `PersistentShieldedSyncState`, `PersistentShieldedActivity`). The comment at 606-607 ("Best-effort — failure logs but doesn't abort the wipe") codifies the exact split-brain the FFI contract was designed to prevent. The Swift surface already distinguishes this code: `PlatformWalletResultCode.errorShutdownIncomplete` exists (PlatformWalletResult.swift:47) and decodes into `PlatformWalletError.shutdownIncomplete(detail)` (line 240), so the host can branch on it. Minimum fix: on the typed `shutdownIncomplete` case (or any thrown error from `clearShielded()`, since the Rust contract is fail-closed there), short-circuit the SwiftData wipe and surface the error to the user instead of proceeding.
In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:137-148: [prior-1, partial fix landed] Residual TOCTOU between `is_clearing` check and `reopen_quiescing_gate`
The new `is_clearing(self.worker)` short-circuit at line 145 closes the common race (clear is already holding the latch when spawn arrives), and the SEC-002 regression test in mod.rs:1071-1108 covers that ordering. But the check (line 145) and the gate-lower (line 148) are not done under a single critical section: `is_clearing` takes `lock_clearing`, returns the boolean, releases the lock; then `reopen_quiescing_gate` writes `false` to the shared atomic. If `clear_shielded_inner` acquires the latch AND raises the quiescing gate between those two points, the reopen at 148 lowers a gate Clear believes it holds continuously across drain → liveness → wipe. In that window a direct `sync_now`/`sync_wallet` going through `begin_pass` (lines 306-341) observes `quiescing == false`, claims the `is_syncing` slot, runs the pass to completion, and persists into the store about to be wiped. `quiesce_under_held_gate`'s `debug_assert!` (line 230) catches the inconsistency only in debug builds; release silently proceeds with the drain. The window is microseconds-narrow, but the design invariant is continuous gate-hold. Structural fix: have the registry expose an atomic observe-and-act under one `lock_clearing()` scope (e.g. `with_clearing_check`) so the gate write is sequenced with the latch read, or move the gate write inside the registry slot critical section alongside the latch check.
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:207-214: [prior-3, carried forward] Public `quiesce()` drops an `AtomicFlagGuard` on the same shared atomic that `clear_shielded` holds raised
`CoordinatorLifecycle::quiesce` constructs `_quiescing_gate = AtomicFlagGuard::new(&self.quiescing)` over the same `Arc<AtomicBool>` that `clear_shielded_inner` holds raised via `hold_quiescing_gate`. `AtomicFlagGuard::Drop` unconditionally writes `false`, so any concurrent caller routed through the public `quiesce()` while a clear is in flight will lower the clear-flow's continuously-held gate. `ShieldedSyncManager::quiesce` remains `pub` and reachable through the manager's shielded accessor — one accidental call away from re-opening the same lapse Clear was hardened to prevent. Structural fix: refcount/owner-aware re-raise (so concurrent owners compose), or gate the public `quiesce` against the clearing latch and assert no other holder is active before its guard drop is allowed to fire.
In `packages/rs-dash-async/src/registry.rs`:
- [SUGGESTION] packages/rs-dash-async/src/registry.rs:596-613: [prior-4, carried forward] `ClearingGuard` uses set-membership — overlapping guards release the latch on the first drop
`hold_clearing` does `lock_clearing().insert(key)` into a `BTreeSet<K>`, and `ClearingGuard::Drop` (lines 1099-1103) unconditionally removes the key. With two live guards for the same key (a nested or overlapping clear from a future caller, a test, or an accidental shielded re-clear before the first guard drops), the first drop removes the key while the second guard is still alive — silently lapsing the latch protection the still-live guard is supposed to provide. The new `is_clearing()` check (line 620) inherits the same accounting weakness because it reads the same set membership. The test at lines 1859-1883 acknowledges this in its comment but does not fix the type-level mismatch between the RAII guard's compositional appearance and its set-based semantics. Minimum fix: refcount per key (`BTreeMap<K, NonZeroUsize>`), or add a `debug_assert!` that the insert was novel so accidental nesting fails loudly in debug.
Note: GitHub refused to serve this PR's diff via gh pr diff because it exceeds the 20,000-line limit, so this exact-SHA review is posted body-only. Existing inline threads are reused rather than duplicated.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Latest delta resolves prior-4 and prior-5: ClearingGuard now refcounts via BTreeMap<K, NonZeroUsize> so nested/concurrent guards compose, and the new test hold_clearing_inner_drop_does_not_lapse_outer_protection is non-vacuous and asserts the lapse window. prior-2 (Swift ShieldedService wipe-on-error) remains unchanged and blocking. prior-3 is partially addressed by the new is_clearing bail in public quiesce(), but the same check-then-act TOCTOU pattern from prior-1 now exists at both spawn_periodic_loop (lines 145-148) and quiesce (lines 217-224); collapsed into a single structural finding.
🔴 1 blocking | 🟡 1 suggestion(s)
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift`:
- [BLOCKING] packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:608-616: [prior-2, carried forward] clearLocalState wipes SwiftData rows even when clearShielded() throws — violates the FFI contract
Unchanged at 6b655aab. The Rust FFI contract for `platform_wallet_manager_shielded_clear` surfaces `ErrorShutdownIncomplete` specifically when the Rust shielded store was left intact (Clear aborted before touching it via the drain/orphan-alive backstop). The host MUST NOT commit its own persistence wipe in that case, or SwiftData rows desync from the still-populated shared tree — the exact split-brain this stack is designed to prevent.
Lines 608-616 catch the throw, log via `SDKLogger.error`, then fall through unconditionally into the SwiftData delete block at 623-628 (`PersistentShieldedNote`, `PersistentShieldedOutgoingNote`, `PersistentShieldedSyncState`, `PersistentShieldedActivity`). The comment at 606-607 ("Best-effort — failure logs but doesn't abort the wipe") codifies the wrong invariant. The Swift surface already distinguishes this code path: `PlatformWalletResultCode.errorShutdownIncomplete` decodes into `PlatformWalletError.shutdownIncomplete(detail)`, so the host can branch on it. Minimum fix: on any thrown error from `clearShielded()` (since the Rust contract is fail-closed there), short-circuit the SwiftData wipe and surface the error to the user.
In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:145-224: [prior-1 + residual prior-3] Check-then-act TOCTOU around the shared `quiescing` atomic at both `spawn_periodic_loop` and `quiesce` call sites
Same structural pattern at two call sites — neither performs the `is_clearing` check atomically with the subsequent write to the shared `quiescing` atomic.
1. `spawn_periodic_loop` (145-148): `is_clearing(self.worker)` at 145 takes `lock_clearing`, returns the boolean, releases the lock; `reopen_quiescing_gate()` at 148 then writes `false` to the shared atomic. If `clear_shielded_inner` acquires the clearing latch AND raises the gate between those two operations, this `reopen_quiescing_gate()` lowers a gate Clear believes it holds continuously across drain → liveness → wipe — opening a window for a direct `sync_now`/`sync_wallet` to pass `begin_pass` and persist into the store about to be wiped (even though `start_thread` itself will subsequently be refused by the latch).
2. `quiesce` (217-224): the new fix to bail when `is_clearing` is true closes the dominant case (clear already holds the latch), but introduces the same race shape — `is_clearing` at 217 → unconditional `self.quiescing.store(true, ...)` + `AtomicFlagGuard::new(&self.quiescing)` at 223-224, whose `Drop` (`atomic.rs:25-29`) unconditionally writes `false`. If `clear_shielded_inner` acquires the latch + `hold_quiescing_gate` in the sub-microsecond window between 217 and 223, the racing quiesce installs an `AtomicFlagGuard` over the same atomic the in-flight clear is holding raised; on the quiesce's return the guard drops and lowers the gate, lapsing Clear's continuously-held barrier. The new regression test `shielded_quiesce_during_clear_preserves_quiescing_gate` only exercises the ordering where the clearing latch is acquired BEFORE the quiesce arrives.
Structural fix (resolves both call sites): expose an atomic observe-and-act on the registry under one `lock_clearing()` scope (e.g. a `with_clearing_check(F)` callback that runs F only while the clearing-set lock is held), or move the gate write inside the registry slot's critical section alongside the latch check. Alternatively, refcount the `quiescing` atomic so concurrent owners compose like `ClearingGuard` now does post-prior-4.
Note: Posted body-only because normal inline poster failed: command failed 1: python3 scripts/review_poster.py dashpay/platform 3954 6b655aa --dry-run.
…builds PR #3954 review thread #5 (comment 3481707091) asked for runtime visibility when the registry's reap path runs under `panic = "abort"`: an EpilogueGuard or AtomicFlagGuard panic during teardown aborts the process instead of unwinding, so Drop never releases the orphan-liveness gate and the registry's join-and-orphan-liveness invariant can stay held forever in the worst case. Stable Rust has NO runtime API to query the active panic strategy — only `#[cfg(panic = "...")]` at compile time. iOS release builds intentionally choose `panic = "abort"`, so a hard gate (compile-time error or refusal to construct) is wrong. A compile-time-gated, one-shot `tracing::warn!` from `ThreadRegistry::with_reap_backstop` is a strict improvement over the doc-caveat-only state today: operators get a single audit-trail line at startup pointing them at the registry.rs / atomic.rs caveats. The warn is gated by a process-wide `std::sync::Once` so repeated registry construction (tests, restart paths, multi-instance hosts) fires it once. The static is hoisted to module scope (also `#[cfg(panic = "abort")]`) so the in-crate regression test can probe `Once::is_completed()` directly without subscribing to tracing from a `#[test]`. Doc caveats on `EpilogueGuard` (registry.rs) and `AtomicFlagGuard` (atomic.rs) are extended with a one-liner pointing operators at the new constructor-emitted warn. Tests: - `with_reap_backstop_emits_panic_abort_warn_under_abort_builds` (only compiled under abort) — proves the Once trips on first call and stays one-shot across a second construction. Build-cfg-pinned: if the cfg block in `with_reap_backstop` is ever deleted, this test disappears with it and CI loses the signal. - `with_reap_backstop_no_warn_under_unwind` (only compiled under unwind) — sentinel proving the no-op cfg branch compiles and `with_reap_backstop` keeps behaving like a plain constructor on the default profile. Verification: - `cargo test -p dash-async --all-targets` — 36 passed (was 35). - `cargo clippy -p dash-async --all-targets -- -D warnings` — clean. - `cargo fmt --all -- --check` — clean. - `cargo check -p platform-wallet -p platform-wallet-ffi --all-targets` — clean. - `RUSTFLAGS="-C panic=abort" cargo build -p dash-async --lib` — clean. (`--tests` requires nightly `-Zpanic_abort_tests`; lib path covers the new cfg branch.) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01BcZFae6ABdoMEvCBQZnaJF
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Carried-forward prior findings dominate this incremental review. prior-2 (Swift host wipes SwiftData rows on a failed Rust shielded clear) is STILL VALID and remains blocking — the catch at ShieldedService.swift:611-615 still logs-and-falls-through into the SwiftData delete block, contradicting the Rust FFI's fail-closed ErrorShutdownIncomplete contract. prior-1+prior-3 (TOCTOU between is_clearing and the shared quiescing atomic at both spawn_periodic_loop:145-148 and quiesce:217-224) is STILL VALID as a suggestion. The latest delta (e46249f) only adds a one-shot panic=abort startup warn on with_reap_backstop, doc back-references on AtomicFlagGuard/EpilogueGuard, and two cfg-pinned tests — verified correct, no new defects.
🔴 1 blocking | 🟡 1 suggestion(s)
Findings not posted inline (2)
These findings could not be anchored to the current diff, but they are still part of this review.
- [BLOCKING]
packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:608-616: [prior-2, carried forward] clearLocalState wipes SwiftData rows even when clearShielded() throws — violates the fail-closed FFI contract — Unchanged at e46249f. The Rust FFI forplatform_wallet_manager_shielded_clearsurfacesErrorShutdownIncompletespecifically when the Rust shielded store was left intact (Clear aborted before touching it via the drain / orphan-alive backstop). The host must not commit its own persistence wipe... - [SUGGESTION]
packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:145-224: [prior-1+prior-3, carried forward] Check-then-act TOCTOU betweenis_clearingand the sharedquiescingatomic at bothspawn_periodic_loopandquiesce— Unchanged at e46249f. Same structural pattern at two call sites — neither performs theis_clearingcheck atomically with the subsequent write to the sharedquiescingatomic. 1.spawn_periodic_loop(145-148):is_clearing(self.worker)at 145 takeslock_clearing, returns the boolean, rele...
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift`:
- [BLOCKING] packages/swift-sdk/SwiftExampleApp/SwiftExampleApp/Core/Services/ShieldedService.swift:608-616: [prior-2, carried forward] clearLocalState wipes SwiftData rows even when clearShielded() throws — violates the fail-closed FFI contract
Unchanged at e46249fe. The Rust FFI for `platform_wallet_manager_shielded_clear` surfaces `ErrorShutdownIncomplete` specifically when the Rust shielded store was left intact (Clear aborted before touching it via the drain / orphan-alive backstop). The host must not commit its own persistence wipe in that case, or the SwiftData rows desync from the still-populated shared tree — the exact split-brain this stack is engineered to prevent.
Lines 608-616 catch the throw, log via `SDKLogger.error`, then fall through unconditionally into the SwiftData delete block at 623-628 (`PersistentShieldedNote`, `PersistentShieldedOutgoingNote`, `PersistentShieldedSyncState`, `PersistentShieldedActivity`). The comment at 606-607 ("Best-effort — failure logs but doesn't abort the wipe") codifies the wrong invariant for a fail-closed FFI. The Swift surface already distinguishes this code path: `PlatformWalletResultCode.errorShutdownIncomplete` decodes into `PlatformWalletError.shutdownIncomplete(detail)`, so the host can branch on it.
Minimum fix: on any thrown error from `clearShielded()`, short-circuit the SwiftData wipe, set `lastError` to a retry-actionable message, and return so the user can retry Clear once Rust quiesces.
In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:145-224: [prior-1+prior-3, carried forward] Check-then-act TOCTOU between `is_clearing` and the shared `quiescing` atomic at both `spawn_periodic_loop` and `quiesce`
Unchanged at e46249fe. Same structural pattern at two call sites — neither performs the `is_clearing` check atomically with the subsequent write to the shared `quiescing` atomic.
1. `spawn_periodic_loop` (145-148): `is_clearing(self.worker)` at 145 takes `lock_clearing`, returns the boolean, releases the lock; `reopen_quiescing_gate()` at 148 then writes `false` to the shared atomic. If `clear_shielded_inner` acquires the clearing latch AND raises the gate between those two operations, this `reopen_quiescing_gate()` lowers a gate Clear believes it holds continuously across drain → liveness → wipe — opening a window for a direct `sync_now`/`sync_wallet` to pass `begin_pass` and persist into the store about to be wiped (even though `start_thread` itself will subsequently be refused by the latch).
2. `quiesce` (217-224): the new `is_clearing` early-return closes the dominant case (clear already holds the latch), but introduces the same race shape — `is_clearing` at 217 → unconditional `self.quiescing.store(true, ...)` + `AtomicFlagGuard::new(&self.quiescing)` at 223-224, whose Drop (`atomic.rs:25-29`) unconditionally writes `false`. If `clear_shielded_inner` acquires the latch + `hold_quiescing_gate` in the sub-microsecond window between 217 and 223, the racing quiesce installs an `AtomicFlagGuard` over the same atomic the in-flight clear is holding raised; on the quiesce's return the guard drops and lowers the gate, lapsing Clear's continuously-held barrier. The new regression test `shielded_quiesce_during_clear_preserves_quiescing_gate` only exercises the ordering where the clearing latch is acquired BEFORE the quiesce arrives.
Structural fix (resolves both call sites): expose an atomic observe-and-act on the registry under one `lock_clearing()` scope (e.g. a `with_clearing_check(F)` callback that runs F only while the clearing-set lock is held), or move the gate write inside the registry slot's critical section alongside the latch check. Alternatively, refcount the `quiescing` atomic so concurrent owners compose like `ClearingGuard` now does post-prior-4.
… clearLocalState (thepastaclaw 4580054386 prior-2)
Surface PlatformWalletError.shutdownIncomplete (and any other clearShielded
throw) to lastError and return from clearLocalState before reaching the
SwiftData delete block. The Rust FFI is fail-closed: a thrown error means
the Rust shielded store is still populated (drain timeout, orphan-alive
backstop, or similar), so wiping the SwiftData mirror unconditionally
produced a split-brain — UI showed empty while Rust state lived on. The
docstring at 606-607 also codified the wrong invariant ("Best-effort")
and is reworded to describe the fail-closed semantics.
…is_clearing↔gate TOCTOU (thepastaclaw 4580054386 prior-1+prior-3) Replace the single-holder AtomicBool `quiescing` flag with an AtomicUsize refcount, wrapped by a new RefcountedFlagGuard that composes safely across the two paths that raise it: `quiesce()`'s local guard and the Clear flow's `hold_quiescing_gate`. The prior design had a TOCTOU between `is_clearing()` (lock-release) and the subsequent `store(true)` + AtomicFlagGuard install — a clear that acquired the latch and raised the gate in that ns-window had its barrier silently lowered on the racing quiesce's return, because AtomicFlagGuard::Drop unconditionally writes false. The refcount eliminates that lower-past-other-holder lapse: each party owns its own ref, Drop decrements, the gate is "raised" iff count > 0. Pair the drain_hook's fetch_add with a matching fetch_sub in cancel_join_and_drain so the hook's one-shot raise doesn't leak across coordinator restarts (this removes the need for reopen_quiescing_gate, which has been deleted). Add a regression test that fails against the bool implementation.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Both prior findings are FIXED at f424caf: the Swift fail-closed path now early-returns before the SwiftData wipe when clearShielded() throws (ShieldedService.swift:618), and the is_clearing↔quiescing TOCTOU is closed by switching to a refcounted Arc with composable RefcountedFlagGuard. One new blocking issue remains: the drain_hook's unconditional fetch_add is paired with a fetch_sub that only fires when registry.quiesce() returns a non-NotRunning status AND when cancel_join_and_drain runs to completion. Both agents independently flag this asymmetry; codex via the production-reachable timeout cancellation path that clear_shielded_inner wraps around quiesce_under_held_gate, claude via the registry's Superseded / inner-NotRunning return sites. Either path leaks a +1 into quiescing, leaving the shielded coordinator permanently gated so every future begin_pass bails.
🔴 1 blocking
Findings not posted inline (1)
These findings could not be anchored to the current diff, but they are still part of this review.
- [BLOCKING]
packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:275-292: drain_hook's fetch_add is not cancellation-safe and not symmetric with all of registry.quiesce's return paths — quiescing can leak +1 and permanently gate the shielded coordinator —drain_hookunconditionally doesquiescing.fetch_add(+1)(line 128) and is paired with a single conditionalfetch_sub(-1)here, gated onstatus != WorkerStatus::NotRunning. That pairing assumes a strict invariant — "the registry fires the hook iff it eventually returns a non-NotRunning sta...
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:275-292: drain_hook's fetch_add is not cancellation-safe and not symmetric with all of registry.quiesce's return paths — quiescing can leak +1 and permanently gate the shielded coordinator
`drain_hook` unconditionally does `quiescing.fetch_add(+1)` (line 128) and is paired with a single conditional `fetch_sub(-1)` here, gated on `status != WorkerStatus::NotRunning`. That pairing assumes a strict invariant — "the registry fires the hook iff it eventually returns a non-NotRunning status, and the wallet-side future always runs to completion past line 285." Neither half of that invariant holds:
1. **Cancellation / timeout path (production-reachable).** `clear_shielded_inner` (manager/mod.rs:412-419) wraps `quiesce_under_held_gate()` in `tokio::time::timeout`. A wedged shielded background pass can keep the inner future suspended past line 702 of `registry.quiesce` (drain hook has already been awaited and `fetch_add(+1)` has run) while it polls the join loop. When the outer timeout elapses, the future is dropped, so the matching `fetch_sub` at line 286 never runs. The `_clearing_gate` in `clear_shielded_inner` then drops, releasing only its own contribution. Net result: `quiescing` stays at +1 for the lifetime of this `CoordinatorLifecycle`, so every subsequent `begin_pass` bails on the `quiescing.load() > 0` check at line 378. The shielded coordinator is silently and permanently quiesced — manual `sync_now`, `sync_wallet`, and the background loop all stop doing work — even though the host application has handled the timeout and moved on.
2. **Superseded / inner-NotRunning return paths inside `registry.quiesce`.** The early exit at registry.rs:693 (cancel and handle both absent → drain hook never fires) is safe under the current condition. The two NotRunning returns *inside* the join loop are not: `Step::Superseded` (registry.rs:760/777) and the inner `Step::NotRunning` (registry.rs:763/777) both happen *after* the hook has already fired at line 702, but return `WorkerStatus::NotRunning`, so the conditional `fetch_sub` is skipped — same leak. Today these branches are unreachable in the wallet because the `closing` / `hold_clearing` latches prevent same-key restart racing `quiesce`, but the comment at lines 277-284 documents an invariant stricter than what `registry.quiesce` actually guarantees. The PR description explicitly anticipates rs-dapi adopting `ThreadRegistry`; any future consumer that calls `quiesce()` without an external restart-prevention mutex inherits a silent gate leak.
The fundamental issue is the asymmetric ownership of the refcount: the registry's `drain_hook` raises the gate, but the wallet has to release it from outside the registry, with no signal for "did the hook actually fire on this call." The cleanest fixes are (a) remove the `drain_hook` contribution to `quiescing` entirely — the public `quiesce()` already raises the gate manually before calling `cancel_join_and_drain` (line 236), and `quiesce_under_held_gate` requires the caller to hold one, so the hook's raise is redundant; or (b) restructure so the hook's contribution is held by a cancellation-safe guard the registry hands back to the caller via `quiesce`'s return value, so Drop on any path (including future cancellation) releases the hook's ref. Either way, the conditional decrement here is unsafe and the comment overstates the registry's contract.
…meout cancellation (thepastaclaw 4580546708) The prior commit (f424caf) introduced an asymmetric refcount on `quiescing`: drain_hook did fetch_add inside an async block, while the matching fetch_sub lived in cancel_join_and_drain past the registry await. Under clear_shielded_inner's tokio::time::timeout wrap, a timeout firing mid-drain dropped the cancel_join_and_drain future after the hook synchronously ran fetch_add but before fetch_sub could execute — leaking the ref permanently and silently disabling the shielded coordinator for the rest of the process (every begin_pass bails on the `> 0` check). The hook was redundant: CoordinatorLifecycle::quiesce already raises its own ref via RefcountedFlagGuard::raise before cancel_join_and_drain, and quiesce_under_held_gate requires the caller to hold one. Both RefcountedFlagGuards are cancellation-safe (Drop runs on every exit path including a future dropped at an outer await). The lifecycle's own docstring already documented that it raises its own gate independently because registry.quiesce can early-return without firing the hook. The system-wide shutdown path (registry.shutdown iterating registry.quiesce) no longer gets a gate-raise; acceptable because that path is the destroy contract — direct sync_now during a destroy drain is not a real concern. Add regression test that fails against the leak and passes with the fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01BcZFae6ABdoMEvCBQZnaJF
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Latest delta correctly resolves prior-1 by removing the registry-side drain_hook (worker_config now returns drain: None) and relying on the lifecycle's cancellation-safe RefcountedFlagGuard. The new regression test (quiesce_cancellation_during_drain_does_not_leak_quiescing_refcount) is non-vacuous. New issue introduced by the same delta: PlatformWalletManager::shutdown() goes through registry.shutdown() -> registry.quiesce(key), never through CoordinatorLifecycle::quiesce(), so the per-coordinator quiescing gate is no longer raised at all during manager shutdown. The pre-fix drain hook used to provide that barrier; both reviewers independently flag its loss. The mod.rs:455 docstring is now stale relative to the implementation, and the prior race window where a concurrent host-issued sync_now/sync_wallet can slip past begin_pass during destroy is reopened.
🔴 1 blocking
Carried-forward prior findings
- Prior f424caf blocker is FIXED: FIXED at commit 36aa3a3 via the recommended fix (a) from the prior review:
CoordinatorLifecycle::worker_config()now returnsdrain: None(coordinator_lifecycle.rs:111-117), so no registry-sidefetch_addis ever paired with a wallet-side conditionalfetch_sub. The gate is now raised exclusively by the lifecycle's ownRefcountedFlagGuard—quiesce()'s local guard at line 221 and the Clear flow'shold_quiescing_gate— both of whoseDropis cancellation-safe (runs on future-drop unwind). New non-vacuous regression testquiesce_cancellation_during_drain_does_not_leak_quiescing_refcount(lines 613-711) aborts aquiesce()future mid-drain while a background pass is parked instd::thread::sleepand asserts the refcount returns to 0 — it would time out against the prior drain-hook design. Both the cancellation/timeout leak path and the Superseded/inner-NotRunning paths from the prior finding are eliminated by removing the asymmetric ownership entirely. (Note: the removal also eliminates the gate during PlatformWalletManager::shutdown() — see the new blocking finding above.)
New findings in the latest delta
In packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:
- [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:99-117: Manager shutdown path no longer raises the quiescing gate — direct sync_now/sync_wallet can slip past begin_pass during FFI destroy
Settingworker_config().drain = Nonecorrectly closes the cancellation-safety hole flagged in prior-1, but it also removes the only mechanism by whichPlatformWalletManager::shutdown()raised the per-coordinatorquiescinggate.shutdown()(manager/mod.rs:487-489) callsself.registry.shutdown().await, which iterates workers and callsself.quiesce(key)on the registry directly (registry.rs:834-866). It never goes throughCoordinatorLifecycle::quiesce()(the only path that now raises the gate via the localRefcountedFlagGuardat line 221) orquiesce_under_held_gate()(the Clear flow's path). Withdrain: None,registry.quiesceruns no hook and the gate stays at 0 throughout the entire teardown.
Consequences:
-
Production race during FFI
destroy. The FFI bridge invokesPlatformWalletManager::shutdown(), which is meant to be the use-after-free barrier: cancel the periodic loops, drain any in-flight pass, then return so the host can free the persister and event-handler context. With the gate never raised, a concurrent host-issuedsync_now/sync_walleton another thread can passbegin_pass'squiescing > 0check (line 357), claimis_syncing, run its body after the registry has cancelled/joined the background loop, and callpersister.store(...)or fire a host completion callback afterdestroyreturned and the host freed the callback context. The pre-fix drain hook'sfetch_addran insideregistry.quiesceand held the gate up across each per-worker teardown, blocking exactly this race. That barrier is now gone for the shutdown path, even though it is still in place forlifecycle.quiesce()direct callers and the shielded Clear flow. -
Stale documented invariant. The mod.rs:455 docstring on
shutdown()still claims "Each worker's drain raises itsquiescinggate, cancels the loop, and joins its OS thread / task" and uses that to argue the use-after-free contract holds. The implementation no longer matches that contract: the docstring will mislead anyone (including future Claude) reasoning about destroy-time safety.
Fix options: (a) have PlatformWalletManager::shutdown() route per-coordinator teardown through each coordinator's lifecycle.quiesce() instead of (or in addition to) registry.shutdown(), so the lifecycle-owned RefcountedFlagGuard is in scope across the drain — its Drop is already cancellation-safe so this does not reintroduce prior-1's leak; (b) reintroduce a cancellation-safe drain hook (e.g. a hook that returns the RefcountedFlagGuard to be held by the registry across the same await frame as the join poll, dropped on every exit path including future cancellation), so registry-level teardown still raises the gate without the asymmetric fetch_add/fetch_sub of the original design. Either way, update the mod.rs:455 docstring so the documented destroy-time invariant matches what the code actually does.
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:
- [BLOCKING] packages/rs-platform-wallet/src/manager/coordinator_lifecycle.rs:99-117: Manager shutdown path no longer raises the quiescing gate — direct sync_now/sync_wallet can slip past begin_pass during FFI destroy
Setting `worker_config().drain = None` correctly closes the cancellation-safety hole flagged in prior-1, but it also removes the only mechanism by which `PlatformWalletManager::shutdown()` raised the per-coordinator `quiescing` gate. `shutdown()` (manager/mod.rs:487-489) calls `self.registry.shutdown().await`, which iterates workers and calls `self.quiesce(key)` on the registry directly (registry.rs:834-866). It never goes through `CoordinatorLifecycle::quiesce()` (the only path that now raises the gate via the local `RefcountedFlagGuard` at line 221) or `quiesce_under_held_gate()` (the Clear flow's path). With `drain: None`, `registry.quiesce` runs no hook and the gate stays at 0 throughout the entire teardown.
Consequences:
1. Production race during FFI `destroy`. The FFI bridge invokes `PlatformWalletManager::shutdown()`, which is meant to be the use-after-free barrier: cancel the periodic loops, drain any in-flight pass, then return so the host can free the persister and event-handler context. With the gate never raised, a concurrent host-issued `sync_now` / `sync_wallet` on another thread can pass `begin_pass`'s `quiescing > 0` check (line 357), claim `is_syncing`, run its body after the registry has cancelled/joined the background loop, and call `persister.store(...)` or fire a host completion callback after `destroy` returned and the host freed the callback context. The pre-fix drain hook's `fetch_add` ran inside `registry.quiesce` and held the gate up across each per-worker teardown, blocking exactly this race. That barrier is now gone for the shutdown path, even though it is still in place for `lifecycle.quiesce()` direct callers and the shielded Clear flow.
2. Stale documented invariant. The mod.rs:455 docstring on `shutdown()` still claims "Each worker's drain raises its `quiescing` gate, cancels the loop, and joins its OS thread / task" and uses that to argue the use-after-free contract holds. The implementation no longer matches that contract: the docstring will mislead anyone (including future Claude) reasoning about destroy-time safety.
Fix options: (a) have `PlatformWalletManager::shutdown()` route per-coordinator teardown through each coordinator's `lifecycle.quiesce()` instead of (or in addition to) `registry.shutdown()`, so the lifecycle-owned `RefcountedFlagGuard` is in scope across the drain — its Drop is already cancellation-safe so this does not reintroduce prior-1's leak; (b) reintroduce a cancellation-safe drain hook (e.g. a hook that returns the `RefcountedFlagGuard` to be held by the registry across the same await frame as the join poll, dropped on every exit path including future cancellation), so registry-level teardown still raises the gate without the asymmetric fetch_add/fetch_sub of the original design. Either way, update the mod.rs:455 docstring so the documented destroy-time invariant matches what the code actually does.
Note: Posted as a body-only review because GitHub refuses gh pr diff for this PR with PullRequest.diff too_large (>20,000 lines), so the normal inline mapper could not run.
… in shutdown() Dropping the registry drain_hook (worker_config().drain = None) closed the cancellation-safety leak but removed the only mechanism that raised the per-coordinator `quiescing` gate during PlatformWalletManager:: shutdown(). The race itself was unreachable via FFI (the handle-storage RwLock serializes destroy's `remove` against every sync_now/sync_wallet `with_item`), but the manager lost a self-contained invariant and the mod.rs shutdown docstring became false. shutdown() now quiesces the weight-0 coordinators first, concurrently, each via CoordinatorLifecycle::quiesce() whose RefcountedFlagGuard holds the gate raised across cancel -> join -> in-flight-pass drain. A concurrent direct sync_now/sync_wallet reaching begin_pass observes `quiescing > 0` and bails, so no new pass can claim is_syncing mid-teardown. The guard's Drop returns the refcount to 0 on every exit path (incl. await-cancellation), so no fetch_add/fetch_sub-across-await leak is reintroduced. The weight-10 event adapter that sinks the stores is torn down afterward via ThreadRegistry::shutdown(), preserving the ordering that keeps the store sink alive past the coordinator loops. Corrected the shutdown() docstring to describe the lifecycle-owned gate (not a registry drain hook) and made the multi-thread-runtime precondition explicit (ThreadRegistry::shutdown already asserts it; shutdown() now fails fast). Added regression test shutdown_raises_quiescing_gate_during_teardown asserting the gate is raised mid-teardown, a concurrent sync_now does no work, and the refcount returns to 0 (fails against the prior registry-only shutdown body). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01P47BQ3KtJ8YaS4o5gTBJdB
|
Blocking finding addressed in On the asserted UAF (refuted at HEAD, but the point stands): the concurrent direct But the mechanical observations were correct and worth fixing: with
|
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
The prior shutdown-gate blocker is fixed in the current head because manager shutdown now routes coordinator teardown through each lifecycle quiesce(). However, the new implementation releases those lifecycle gates before the registry shutdown latch is set, leaving a narrow but real post-quiesce race where direct syncs or restarts can enter after the coordinator barrier but before the rest of shutdown is closed.
🔴 1 blocking
Findings not posted inline (1)
These findings could not be anchored to the current diff, but they are still part of this review.
- [BLOCKING]
packages/rs-platform-wallet/src/manager/mod.rs:516-525: Coordinator gates are released before shutdown is latched closed —quiesce_coordinators().awaitraises each lifecycle gate only for the duration of that coordinator'squiesce(). As soon as line 516 returns, those guards have dropped, butThreadRegistry::shutdown()has not yet set the registryclosinglatch until line 523 begins polling the registry futur...
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet/src/manager/mod.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:516-525: Coordinator gates are released before shutdown is latched closed
`quiesce_coordinators().await` raises each lifecycle gate only for the duration of that coordinator's `quiesce()`. As soon as line 516 returns, those guards have dropped, but `ThreadRegistry::shutdown()` has not yet set the registry `closing` latch until line 523 begins polling the registry future. A concurrent Rust caller holding the manager or coordinator `Arc` can enter `sync_now()` / shielded `sync_wallet()` in that gap, see `quiescing == 0`, claim `is_syncing`, and continue persisting or firing callbacks while shutdown tears down the event adapter and returns. A concurrent `start()` can also install a fresh coordinator before `closing` is set; the registry may then correctly report that new worker's shutdown status, but the merge at lines 524-525 overwrites it with the stale status captured before the restart. Hold the coordinator teardown gates, or set an equivalent manager/registry closing barrier, across the entire registry shutdown phase before reporting that shutdown is clean.
…le shutdown teardown Round-1's shutdown hardening (6c9ed0a) held each coordinator's `quiescing` gate only for the duration of its own `quiesce()`. The local `RefcountedFlagGuard`s dropped the moment `quiesce_coordinators()` returned, so the gates were back at 0 while `registry.shutdown()` tore down the weight-10 event adapter (the store sink) and reaped orphans. `begin_pass` checks only the per-coordinator `quiescing` atomic, not the registry `closing` latch, so a direct `sync_now`/`sync_wallet` in that window passed the gate, claimed `is_syncing`, and could persist / fire a host callback while the sink was being torn down — defeating the self-contained barrier the round-1 change was meant to provide. (Still unreachable via FFI — the HandleStorage RwLock serializes destroy's write/remove against sync's read/with_item — but reachable by any downstream Rust consumer of the public API.) shutdown() now acquires a held `RefcountedFlagGuard` per coordinator via `hold_coordinator_gates()` BEFORE quiescing and keeps them in scope across `quiesce_coordinators()`, `registry.shutdown()`, and the report build — dropped only at function end. The refcount (composing with each quiesce's own transient raise) stays > 0 for the entire teardown and returns to 0 only once shutdown completes or unwinds (Drop is cancellation-safe). A direct sync now bails at `begin_pass` throughout, including during the event-adapter join. Also hardened the per-worker report merge: overwrite a coordinator's slot with its quiesce status ONLY when the registry classified it `NotRunning`. A concurrent `start()` that slips a fresh worker in before the registry latches closed is joined+reported live by `registry.shutdown()`; that live status is now preserved instead of being clobbered by the stale pre-restart status (which could mask a wedged worker as clean). Added `hold_quiescing_gate()` wrappers on the address + identity coordinators (shielded already had one). New regression test `shutdown_holds_gate_through_registry_teardown` parks a wedged orphan so the registry reap blocks ~1s, then asserts the gate is still raised in that phase, a direct sync bails on `quiescing > 0`, and the refcount returns to 0 only after shutdown completes (fails against the 6c9ed0a body where the gate drops at quiesce_coordinators return). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01P47BQ3KtJ8YaS4o5gTBJdB
|
Round-2 blocking finding fixed in Confirmed: round-1 held each coordinator's Fix: Restart/merge: also fixed. The per-worker merge now overwrites a coordinator's slot only when the registry classified it Test: new
|
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Carried-forward prior findings: prior-1 is FIXED. The current head raises every coordinator gate before coordinator quiesce and holds those guards through registry shutdown and status merging, and I found no new in-scope latest-delta findings. Targeted test execution could not complete because the build needed to download Tenderdash proto sources and the network escalation retry was rejected.
Note: Posted body-only because GitHub refused the PR diff (PullRequest.diff too_large, over 20,000 lines), so inline mapping could not run. The verified output contained no findings.
… join [#3954, independent on v3.1-dev] Introduce a shared `ThreadRegistry<WalletWorker>` lifecycle engine in `dash-async` (weight-ordered graceful shutdown: lower weights drain first, equal weights concurrently; reap-or-park orphan handling; cancellation-safe `RefcountedFlagGuard`). Migrate the three periodic sync coordinators (identity / platform-address / shielded) onto a shared `CoordinatorLifecycle` that drives them through the registry and gates passes via an `is_syncing` / `quiescing` handshake. `PlatformWalletManager::shutdown()` now holds every coordinator's quiescing gate across the whole teardown, quiesces the weight-0 coordinators concurrently, then drains the weight-10 wallet-event adapter and any parked orphans, returning a per-worker `ShutdownReport`. `clear_shielded` refuses (leaving the commitment-tree store intact) when the in-flight pass does not drain cleanly, surfaced as `ShieldedShutdownIncomplete` / `ErrorShutdownIncomplete = 19`. FFI `destroy` reports the incomplete-shutdown code so the Swift host defers freeing its callback context. Reconstructed independently on a clean v3.1-dev base: only the graceful- shutdown contribution is included; the drifted #3692 rehydration/storage work merged into the source branch is excluded. <sub>🤖 Co-authored by [Claudius the Magnificent](https://github.com/lklimek/claudius) AI Agent</sub> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
17e1c07 to
a89e0f9
Compare
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
No carried-forward prior findings from the 17e1c07 review; the earlier shutdown-gate issue remains fixed. I verified one new in-scope latest-delta regression: the SQLite same-path open guard was removed while buffers remain per-persister. The destroy finding was dropped because the PR explicitly documents destroy as terminal and the Swift wrapper implements that contract by retaining callback contexts on incomplete shutdown.
🔴 2 blocking
Findings not posted inline (1)
These findings could not be anchored to the current diff, but they are still part of this review.
- [BLOCKING]
packages/rs-platform-wallet-storage/src/sqlite/persister.rs:205-208: Same-path SQLite opens now create divergent buffers — The latest delta removed the process-wide canonical-path registry and theAlreadyOpenerror, soopen()now returns a freshSqlitePersisterwith its ownBuffer::new()for a database path that may already be live in this process. SQLite can serialize committed transactions, but it cannot co...
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet-storage/src/sqlite/persister.rs`:
- [BLOCKING] packages/rs-platform-wallet-storage/src/sqlite/persister.rs:205-208: Same-path SQLite opens now create divergent buffers
The latest delta removed the process-wide canonical-path registry and the `AlreadyOpen` error, so `open()` now returns a fresh `SqlitePersister` with its own `Buffer::new()` for a database path that may already be live in this process. SQLite can serialize committed transactions, but it cannot coordinate these Rust-side buffers: `store()` stages data in the handle-local buffer, `flush()` drains only that buffer, and `load()` reads through only that persister's connection. Two in-process managers using the same wallet DB can therefore miss each other's pending state or commit competing buffered changes in last-flush-wins order. Restore the same-path guard, or move the buffer into a shared path-scoped owner before allowing multiple live handles for one path.
- [BLOCKING] packages/rs-platform-wallet-storage/src/sqlite/persister.rs:205-208: Same-path SQLite opens now create divergent buffers
The latest delta removed the process-wide canonical-path registry and the `AlreadyOpen` error, so `open()` now returns a fresh `SqlitePersister` with its own `Buffer::new()` for a database path that may already be live in this process. SQLite can serialize committed transactions, but it cannot coordinate these Rust-side buffers: `store()` stages data in the handle-local buffer, `flush()` drains only that buffer, and `load()` reads through only that persister's connection. Two in-process managers using the same wallet DB can therefore miss each other's pending state or commit competing buffered changes in last-flush-wins order. Restore the same-path guard, or move the buffer into a shared path-scoped owner before allowing multiple live handles for one path.
|
Re: thepastaclaw's 2026-06-29 re-review ( This is a storage-crate concern that belongs to #3968, not this PR. #3954 ( The concern itself is real and already addressed in #3968: the process-wide canonical-path registry + Dispositioning as out of scope for #3954; tracked and fixed in #3968. 🤖 Co-authored by Claudius the Magnificent AI Agent |
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Thanks for the correction. I rechecked current head a89e0f97daefc9363005644a77c0b1b4fa000509 and agree the same-path SQLite open finding is out of scope for #3954: this PR's diff has no rs-platform-wallet-storage files, and the storage persister guard belongs to the sibling storage PR #3968 / its branch context rather than this ThreadRegistry + wallet lifecycle PR.
I am superseding my 2026-06-29 CHANGES_REQUESTED review for #3954. I do not see any remaining in-scope blockers on the current #3954 head from the prior shutdown-gate review trail.
Bring the PR branch up to date with the new base v4.1-dev (retarget from previous base). Clean merge, no conflicts. <sub>🤖 Co-authored by [Claudius the Magnificent](https://github.com/lklimek/claudius) AI Agent</sub>
…d destroy is the join point No FFI test exercised the `*_stop` symbols. Add a shielded_sync test module that parks a shielded pass in flight (slow completion callback holding `is_syncing`) and asserts: - `platform_wallet_manager_shielded_sync_stop` returns while the pass is still parked and well under the park time — it signals cancellation and does not `block_on` the quiesce. - `platform_wallet_manager_destroy` reports a clean `ShutdownReport` and only returns once the parked pass has drained — it is the real join/drain point, mirroring `identity_sync_stop`/`platform_address_sync_stop`. Plus a fast no-op case: `stop` on a never-started coordinator is a prompt `Success` and `destroy` on the idle manager reports clean. The test needs a mock SDK, so re-enable `dash-sdk`'s `mocks` feature as a dev-dependency (dev-only; the production cdylib/staticlib never pulls it in). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ator shutdown internals Apply the present-state / no-ephemeral-ID comment rules to the shutdown lifecycle code: - registry.rs: move the `TODO(rs-dapi-adoption)` block out of `start_task`'s doc span and shrink it; strip `TC-NNN`/`GAP-NNN` (+ `F2`/`R4`) tags from test doc comments AND rename the test functions to plain descriptive snake_case; rewrite the "pre-refactor"/"no longer" backward narration to state the current invariant directly. - coordinator_lifecycle.rs: rewrite the `worker_config` and `quiesce` docs and the drain-cancellation regression test to present-tense design rationale (drop "prior TOCTOU"/"leaked forever"/"pre-fix"/removed-drain-hook narration); trim the verbose test-setup rationale to why `pass_entered` is needed. - mod.rs: strip `SEC-001`/`SEC-002`/`TC-002 (F2)` prefixes and the bare `F1:` label (keeping the claim); reframe the "against the prior body / pre-fix quiesce / prior behaviour" Non-vacuous docs and inline comments as hypothetical contrasts. Preserved deliberately: the module safety-invariants header, the slots->orphans lock-nesting note, and the LOAD-BEARING Dekker-handshake SeqCst rationale. Left the forward-hypothetical `Non-vacuous:` comments in mod.rs (they describe a broken alternative impl, not code history) — valuable non-vacuousness rationale, not backward narration. Comment- and test-name-only; no production logic, control flow, or public API signatures changed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…se onto local shutdown work Reconciliation merge (real merge commit, no rebase/force-push). origin was rebased onto v4.1-dev, which advances the base to the Dash 4.0.0 release — rust-dashcore 0.44.0 (rev 991c6ebe) and grovedb 5.0.0. This merge folds that base bump into local's verified ThreadRegistry / coordinator-shutdown work. The 4 conflicts all resolved to local's side — local is strictly ahead on every one, origin carried zero unique content worth keeping: - rs-dash-async/src/registry.rs (add/add): byte-identical to local's pre-sweep commit; local HEAD additionally has the comment-hygiene sweep. - rs-platform-wallet/src/manager/coordinator_lifecycle.rs (add/add): same — byte-identical to local's pre-sweep commit. - rs-platform-wallet/src/changeset/core_bridge.rs: local carries the drain-buffered-events-on-cancel work plus its two follow-up hardening commits; origin lacks all three. - rs-platform-wallet/src/manager/mod.rs: all 13 conflict hunks are comment-only (local's sweep drops ephemeral review IDs and history phrasing). The event_manager field stays ungated: local's load_from_persistor reads it unconditionally, so origin's shielded-gating (correct only for origin's tree, which lacks that rework) would break the non-shielded build. The ThreadRegistry/CoordinatorThreadStatus collapse is already present on both sides; nothing to reapply. All shutdown-teardown functionality carried forward with no functional loss (verified by empty diffs against pre-merge HEAD).
There was a problem hiding this comment.
Code Review
The prior blocking finding (same-path SQLite opens creating divergent buffers) is FIXED at 6594bf3 — the process-wide open_path_registry, register_open_path/release_open_path, AlreadyOpen guard, and Drop-based release are all restored in packages/rs-platform-wallet-storage/src/sqlite/persister.rs (lines 87-111, 234). The latest delta (a89e0f9..6594bf3) is a documentation/test-name comment-hygiene sweep on registry.rs/coordinator_lifecycle.rs/manager/mod.rs plus a new FFI regression test proving platform_wallet_manager_shielded_sync_stop is cancel-only while platform_wallet_manager_destroy is the actual join point. No new in-scope blocking or suggestion-level defects identified across six parsed Claude-family reviewer outputs.
_Note: Normal review_poster dry-run failed (Traceback (most recent call last):
File "/Users/claw/.openclaw/workspace/scripts/review_poster.py", line 1163, in
result = post_review(
File "/Users/claw/.openclaw/workspace/scripts/review_poster.py", line 874, in post_review
diff_text = _gh("pr", "diff", str(pr_number), "--repo", repo)
File "/Users/claw/.openclaw/workspace/scripts/review_poster.py", line 96, in gh
raise RuntimeError(f"gh {' '.join(args)} failed: {result.stderr.strip()}")
RuntimeError: gh pr diff 3954 --repo dashpay/platform failed: could not find pull request diff: HTTP 406: Sorry, the diff exceeded the maximum number of lines (20000) (https://api.github.com/repos/dashpay/platform/pulls/3954)
PullRequest.diff too_large), so I posted the verified findings as a top-level exact-SHA review body.
Reviewed commit: 6594bf3
Why this PR exists
ThreadRegistry(rs-dash-async) and aCoordinatorLifecyclethat own worker registration, weighted graceful-shutdown join, and quiescing gates; the identity / platform-address / shielded sync managers are re-plumbed onto it.v3.1-dev— references zero feat(platform-wallet): watch-only rehydration from persistor (seedless load) #3692 symbols; reworks the v3.1-dev sync managers directly. Combines with feat(platform-wallet): watch-only rehydration from persistor (seedless load) #3692/feat(platform-wallet): persistence readers + seedless load() wiring (split from #3692) #3968 on thedash-evo-toolintegration branch (where it also picks up a shutdown cancel-drain enhancement that needs feat(platform-wallet): watch-only rehydration from persistor (seedless load) #3692's changeset reshape to exist).What was done
rs-dash-async:ThreadRegistry, atomic flag guards,ShutdownReport/WorkerStatus.rs-platform-wallet:CoordinatorLifecycle, registry-driven event-adapter task,shutdown()with weighted join + detached-orphan reporting;ShieldedShutdownIncompleteerror + FFI code 19.Testing
cargo fmt --all --check;cargo clippy --workspace --all-targets -- -D warnings;cargo test -p platform-wallet -p dash-async(dash-async 38; platform-wallet 322 with--all-features; 47 shutdown/registry tests incl.shutdown_joins_all_workers,shutdown_waits_for_in_flight_pass_to_drain,restart_in_place). The 4 Swift files need a separate iOS build check.Breaking changes
!— adds an FFI error code (19) and a shutdown error variant.Checklist
🤖 Co-authored by Claudius the Magnificent AI Agent