Skip to content

fix(trace-analyst): count read/inspect tools as self-verification#271

Merged
drewstone merged 1 commit into
mainfrom
fix/self-verification-tool-names
Jun 21, 2026
Merged

fix(trace-analyst): count read/inspect tools as self-verification#271
drewstone merged 1 commit into
mainfrom
fix/self-verification-tool-names

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

The no-self-verification behavioral signal fires a false positive on real coding-agent traces. VERIFY_RE only matched verb-named tools (verif|eval|inspect|check|assert|validat|review|confirm), but the tools agents actually verify with are named Read / Grep / Glob (Claude Code), read_file / ls / cat (codex), and git status / diff — none of which match. Result: any session that reads or greps state (≈ all of them) was reported as "never verifies."

Fix: add the read / search / list / diff / status / test / lint families to the matcher. Bare shell (Bash / exec_command) stays unmatched on purpose — its name can't tell a pytest from an rm, so matching it would just create the opposite false signal.

Measured against a 14,541-session corpus: 99% of edit-sessions also Read/Bash, yet the metric reported 0 verification — the bug this fixes.

Tests: existing 10 stay green (synthetic world.execute/world.eval names unaffected); added a regression test that Read/Grep/read_file/git.status count as verification. Full suite 2501 pass, typecheck clean.

VERIFY_RE only matched verb-named tools (verif/eval/inspect/check/...), but real harnesses verify with Read/Grep/Glob (Claude Code), read_file/ls/cat (codex), and git status/diff — none of which match. So no-self-verification fired on ~any real session that reads or greps state, a false positive. Add the read/search/list/diff/status/test/lint families; leave bare shell (Bash/exec_command) unmatched since the name can't distinguish a test run from a mutation.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 2f42430a

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-21T16:36:46Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 2f42430a

Readiness 82/100 · Confidence 65/100 · 6 findings (2 medium, 4 low)

opencode-kimi glm deepseek aggregate
Readiness 85 86 82 82
Confidence 65 65 65 65
Correctness 85 86 82 82
Security 85 86 82 82
Testing 85 86 82 82
Architecture 85 86 82 82

Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Unbounded read substring in VERIFY_RE creates false-positive self-verification — src/trace-analyst/behavioral-metrics.ts

Line 58-59 adds read without word boundaries to VERIFY_RE. Because the regex is case-insensitive, a tool named thread, ready, readme, pread, readline, etc. will match and be counted as self-verification, incorrectly suppressing the no-self-verification signal. Concrete: VERIFY_RE.test('thread') is true. This undermines eval validity for any harness whose tool names contain that substring. Fix: scope read to real read-tool shapes, e.g. \bRead\b|\bread_file\b|\bread_dir\b|\breadFile\b|\breadDirectory\b or \b(read|Read)(?:_file|_dir|File|Directory|Text)?\b.

🟠 MEDIUM VERIFY_RE: read/view/glob/diff lack word boundaries → false-positive self-verification matches — src/trace-analyst/behavioral-metrics.ts

Line 59. Tokens list, ls, cat, find use \b word boundaries, but read, view, glob, diff, status do not. Confirmed edge cases: 'thread', 'spreadsheet', 'spread_operator', 'already_setup', 'treadmill' match via read substring; 'global' matches via glob; 'preview', 'overview' match via view; 'differential' matches via diff. A tool named 'thread_pool' or 'spreadsheet_export' would be counted as self-verification, silently suppressing the no-self-verification signal — the agent looks better than it is. Impact: eval validity (false negative on behavioral signal). Fix: add \b to read, view, glob, diff, status (e.g. \bread\b|\bview\b|\bglob\b|\bdiff\b|\bstatus\b)

🟡 LOW New test is positive-only; no negative cases anchor the design intent — src/trace-analyst/behavioral-metrics.test.ts

Lines 120-128: the new test only verifies that Read/Grep/read_file/git.status DO match. It does not lock in the documented exclusion of shell wrappers (Bash/exec_command) nor the boundary behavior (\blist\b matching list but not lsremote). The existing fixture test at line 60-67 indirectly covers world.execute not matching, but the new regex adds boundary-anchored patterns whose correctness is currently untested. Suggested addition: `for (const tool of ['Bash','exec_command','lsremote','catapu

🟡 LOW No negative-test coverage for VERIFY_RE boundary — src/trace-analyst/behavioral-metrics.test.ts

Lines 120-128. The new test only verifies that expected tools (Read, Grep, read_file, git.status) ARE counted as self-verification. No test verifies that non-verification tools (Bash, exec_command, Write, Edit, Task, thread, spreadsheet) are NOT counted. A regex that matched everything would pass the current tests. Add a negative case: toolSpan on 'Bash' or 'exec_command' should not set hasSelfVerification.

🟡 LOW Substring matches on unbounded alternatives create false-positive verification classification — src/trace-analyst/behavioral-metrics.ts

Line 59: alternatives read|view|find|diff|status|search are unbounded. Probe confirms matches on spreadsheet (read), thread (read), research_agent (search), diff_apply (diff), status_report_writer (status), viewer_metrics_push (view). A tool literally named any of these would be counted as self-verification, suppressing the no-self-verification signal. Realistic impact is low for current major harnesses (Claude Code, codex, Cursor) but a custom harness with e.g. a thread tool would silently flip the signal. Mitigation already applied to the shortest common words (\blist\b, \bls\b, \bcat\b) — consider extending word-boundary anchoring to `

🟡 LOW pytest/vitest don't match \btest despite comment claiming 'test' coverage — src/trace-analyst/behavioral-metrics.ts

Line 59: \btest only has a LEFT word boundary. In pytest/vitest, test is preceded by a word char (p/i), so there is no boundary and the regex does NOT match. Verified by probe: pytest→false, vitest→false. The comment at line 53-57 advertises 'test/lint' as covered. Practical impact is low because test runners are usually invoked through Bash/exec_command (which is intentionally excluded — its name can't discriminate), but the comment overstates coverage. Fix: add |pytest|vitest|jest|mocha explicitly OR dr


tangletools · 2026-06-21T16:42:11Z · trace

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 3 (3 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 246.2s (2 bridge agents)
Total 246.2s

💰 Value — sound-with-nits

Extends the existing VERIFY_RE so real coding-agent tool names (Read/Grep/read_file/git status) count as self-verification, fixing a real false-positive; ship.

  • What it does: Expands the single VERIFY_RE regex in src/trace-analyst/behavioral-metrics.ts:58-59 to also match the read/search/list/diff/status/test/lint tool-name families. Span tool names like Read, Grep, read_file, git.status now set hasSelfVerification=true (line 120) and suppress the 'no-self-verification' signal (line 177). Bare shell (Bash/exec_command) stays intentionally unmatched. Adds a 4-case regre
  • Goals it achieves: Fixes a false positive in the deterministic 'no-self-verification' behavioral signal. Real coding-agent traces (Claude Code, codex) verify by reading/grepping state, but their tools are named Read/Grep/read_file/git status — none matched the old verb-only matcher (verif|eval|inspect|check|assert|validat|review|confirm), so virtually every real session was reported as 'never verifies'. After merge,
  • Assessment: Good change on its merits. It is small (one regex expansion + comment), targeted (one false positive), and squarely in the grain of the file's existing design — VERIFY_RE was already a single-regex tool categorizer (line 120), and this extends it. The comment at lines 53-57 honestly documents the boundary (shell excluded because the name can't disambiguate pytest from rm). The 14,541-session claim
  • Better / existing approach: Searched src/ for tool-categorization regexes. Found scattered equivalents with DIFFERENT intent: EDIT_TOOLS at src/storyboard/code-edit.ts:20, duplicated read/edit matchers at src/storyboard/index.ts:159,166,175,284,286, and edit/test regexes at src/run-critic.ts:88,101. None of them is a true duplicate of 'is this verification vs. mutation' — they classify for rendering or critic scoring, with d
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 2
  • Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

🎯 Usefulness — sound-with-nits

Expands the VERIFY_RE regex with read/search/diff/test/lint tool families to fix a false positive where coding agents that read/grep state were incorrectly flagged as 'never verifies' — the change is minimal, well-targeted, and flows through the existing deterministic-signal pipeline automatically.

  • Integration: Fully reachable. The VERIFY_RE regex at behavioral-metrics.ts:58-59 feeds hasSelfVerification at :120, which controls the no-self-verification signal at :177. computeTraceMetrics is called by behavioralAnalyst().analyze() at behavioral-analyst.ts:86, which is always registered in buildDefaultAnalystRegistry() at default-registry.ts:36, and also exported for direct use from src/index.ts:376. Every
  • Fit with existing patterns: Natural extension of the existing pattern. The codebase has exactly one self-verification classifier — this regex — and no competing mechanism. The comment at :53-57 explains why each tool family is included and why shell tools are deliberately excluded. Same deterministic, model-agnostic approach as the module's other three behavioral signals.
  • Real-world viability: Covers the real tool names used by agent harnesses: Claude Code (Read/Grep/Glob), Codex (read_file/ls/cat), git (status/diff), test/lint runners. Tests at behavioral-metrics.test.ts:120-128 verify Read, Grep, read_file, and git.status all suppress the false signal. Existing 10 tests stay green. The intentional exclusion of bare shell (Bash/exec_command) avoids the opposite false signal. One weak n
  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 3
  • Bridge warning: opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

💰 Value Audit

🟡 Substring alternation read lacks word boundaries the rest of the regex uses [better-architecture] ``

VERIFY_RE mixes anchored alternates (\blist\b, \bls\b, \bcat\b, \bfind\b at line 59) with bare substrings like read, view, search, eval. read matches Thread/Already/Treadmill as well as Read/read_file. Real tool names won't trip this today, but the inconsistency is the kind of thing that bites when a new harness lands. Either anchor the substring alternates consistently or add a quick test for known non-matches (Bash, exec_command, thread). Note for the reviewer; does not gate shipping

🟡 N+1 scattered tool-name regex; consolidation overdue [duplication] ``

The codebase already hand-rolls tool-categorization regexes in at least three places (src/storyboard/code-edit.ts:20 EDIT_TOOLS; src/storyboard/index.ts:159,166,175,284,286 with two duplicated read matchers; src/run-critic.ts:88,101). This PR adds a fourth at behavioral-metrics.ts:58. Each has different intent so none is a true duplicate, but the pattern is clearly straining. A future PR could lift a shared tool-categories.ts (EDIT/READ/VERIFY/TEST families) and have each site compose from it. N

🎯 Usefulness Audit

🟡 read without word boundaries could match unintended tool names in edge cases [robustness] ``

The regex uses read (not \bread\b) at behavioral-metrics.ts:59. This deliberately allows matching read_file and similar, but in the highly unlikely scenario of a tool named thread, bread, or spread, it would count as verification. In practice, OTLP tool names from agent harnesses are sparse identifiers (Read, Bash, Write, read_file, grep, ls) — not prose. This is a theoretical edge, not a practical one. Adding word boundaries would break read_file matching. No action needed unless


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260621T164309Z

@drewstone drewstone merged commit 5c902e3 into main Jun 21, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants