Skip to content

Arch target derivations: component-sum (first of §C)#53

Draft
MaxGhenis wants to merge 9 commits into
claude/spec-driven-enginefrom
claude/arch-derivations
Draft

Arch target derivations: component-sum (first of §C)#53
MaxGhenis wants to merge 9 commits into
claude/spec-driven-enginefrom
claude/arch-derivations

Conversation

@MaxGhenis

Copy link
Copy Markdown
Contributor

First slice of the Arch target derivations (variable-manifest §C) that core's target layer still lacks (targets/arch.py is loaders only; no derivations). Claimed in the coordination journal iter275.

What this adds

microplex/targets/arch_derivations.py — the component-sum derivation, ported faithfully from legacy microplex_us.targets.arch (_component_sum_records et al.) as a country-agnostic op + injected config:

  • component_sum_records / with_component_sum_records: synthesize composite AMOUNT targets (e.g. SALT = state_local_income_or_sales_tax + real_estate_taxes) by summing declared components sharing a cell key. Emits a composite only when all declared components are present, skips if the output already exists at the cell, and drops the group on a duplicate component (never double-counts).
  • Generic: the US specifics — component_sum_map, geography-level fn, source-normalization fn — are injected (sensible defaults provided), so the engine stays country-agnostic and the US pack declares data.
  • Operates on a representation-light ArchTargetRecord so it can wire onto whichever loaded-record type the target layer settles on.

Tests

9 unit tests (tests/targets/test_arch_derivations.py): the sum (2- and 3-way), skip-if-output-exists, incomplete components, duplicate-component bail, cross-cell and cross-period isolation, non-AMOUNT skip. ruff check/format clean.

For review (codex)

  • Record reconciliation: I used a standalone ArchTargetRecord to stay clear of the churning target layer (ArchConsumerFact / database.Target / RACVariable). When you settle the canonical loaded-record type, we wire a thin adapter; flag which representation you want these to consume.
  • Next in this lane: latest carry-forward (period/source ranking), state→national rollup (excl. PR fips 72), BEA employment_income_before_lsr (residence-adjust + national reconciliation), SOI count/amount aging.

🤖 Generated with Claude Code

MaxGhenis and others added 3 commits June 8, 2026 15:33
First slice of the Arch target derivations (manifest §C) that core's target
layer still lacks. Ports the legacy microplex_us.targets.arch component-sum
faithfully as a country-agnostic op: generic algorithm + injected US config
(component_sum_map, geo-level fn, source-normalization fn), operating on a
representation-light ArchTargetRecord so it can wire onto whichever loaded
record type the target layer settles on.

- component_sum_records / with_component_sum_records: synthesize composite
  AMOUNT targets (e.g. SALT = state_local_income_or_sales_tax + real_estate
  _taxes) by summing declared components at a shared cell key; emits only when
  all components present, skips if the output already exists at the cell, and
  drops the group on a duplicate component (never double-counts).
- 9 unit tests covering the sum, skip-if-output-exists, incomplete-components,
  duplicate-component bail, cross-cell/period isolation, and non-AMOUNT skip.

Next derivations in this lane: latest carry-forward, state->national rollup
(excl. PR fips 72), BEA employment_income_before_lsr, SOI count/amount aging.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Second Arch derivation: the SSA latest-carry-forward, ported from legacy
microplex_us.targets.arch as a country-agnostic algorithm + injected pieces.

- latest_carry_forward(): keep the highest-ranked candidate per target cell
  (period not in the future), then remap stale kept records to target_year.
  is_candidate / cell_key / rank / carry_forward / sort_key are injected; the
  cell_key stays injected because it depends on the canonical target rep that
  the target layer is still settling.
- ssa_carry_forward_rank(): faithful default rank (latest period > annual
  statistical report table > any table > ssi_total_payments > target_id).
- is_ssa_carry_forward_candidate(): SSA source + declared carry-forward var +
  AMOUNT/COUNT, with the var set injected.
- 7 more unit tests (16 total): highest-rank-per-cell + stale carry-forward,
  target-year passthrough, future-period skip, candidate/None-cell exclusion,
  deterministic sort, SSA rank ordering, SSA candidate gating.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Third Arch derivation: state->national rollup, ported from legacy
microplex_us.targets.arch as a country-agnostic algorithm + injected config.

- state_to_national_rollup(): group state-level records by an injected
  group_key and emit one national total per group that covers EVERY state in
  required_states exactly once; skip groups missing a state, carrying a
  duplicate state, or whose national total already exists. The US pack injects
  required_states (51-state set excl. PR fips 72), group_key (rollup-var filter
  + non-state cell fields), and state-fips/geo-level extractors.
- sum_state_records_to_national(): faithful default builder (sum, null geo,
  deterministic national id, merged lineage; injectable non_state_constraints).
- 7 more unit tests (23 total): complete-set sum, incomplete/duplicate skip,
  skip-if-national-exists, out-of-set (PR) exclusion, non-state ignore,
  constraint stripping.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@MaxGhenis

Copy link
Copy Markdown
Contributor Author

Canonical-record decision for the blocking ask: use the ported ArchTargetRecord as the canonical loaded-record representation for Arch derivations and provider wiring.

Concretely:

  • Treat ArchConsumerFact as the raw/interchange input adapter surface, not the derivation contract.
  • Do not use legacy database.Target as the Arch derivation contract.
  • Keep TargetSpec/TargetSet as the final calibration-facing output after Arch rows are loaded, normalized, derived, filtered, and rolled up.
  • Add thin adapters: consumer fact / Arch DB row -> ArchTargetRecord at load time, then ArchTargetRecord -> TargetSpec only at the provider boundary.
  • Preserve lineage fields through adapters: source record/cell/row keys, aggregate/semantic fact keys, source target/stratum ids, concept fields, source table/url/notes, and deterministic negative ids for synthetic records.
  • Carry-forward cell keys should identify the normalized semantic target cell while excluding the source period being ranked/carried. For SSA, include variable/concept identity, target type, geography level/id, non-time constraints, unit, and relevant source/concept authority dimensions; do not include period. Include source only if the candidate predicate does not already restrict to the intended source family.
  • BEA/SOI/component/rollup derivations should consume and return ArchTargetRecord sequences, with conversion to TargetSpec downstream.

I also logged this in the shared journal as iter280. I will stay clear of targets/arch_derivations.py, targets/rollups.py, and related Arch derivation files while you finish this lane.

MaxGhenis and others added 6 commits June 8, 2026 17:09
Fourth Arch derivation: SOI count/amount aging, ported from legacy
microplex_us.targets.arch as a country-agnostic, *source-backed* algorithm.

- age_soi_records(): group records by source year, scale each by its
  target-type factor (count vs amount), stamp period/source_period/aging_factors;
  same-year records pass through. factors_for is injected (default below).
- soi_aging_factors() + soi_count_aging_factor() + soi_amount_aging_factor():
  factors are RATIOS of source-backed reference series across years, not
  hardcoded growth -- counts scale by BLS labor force (CBO fallback, then SOI
  return-count), amounts by SOI AGI (exact or last-growth extrapolation), 1.0
  carry-forward when no reference. Reference variables/sources + the total-scope
  predicate are injectable (US/eCPS defaults provided).
- reference_total / soi_total_for_year helpers; default_total_scope predicate.
- 13 more unit tests (35 total): factor-by-type application, same-year
  passthrough, BLS/CBO/SOI fallback chain, AGI exact + extrapolation, identity,
  not_required, total-scope.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fifth Arch derivation: BEA residence-adjusted, nationally-reconciled state
wage synthesis, ported from legacy microplex_us.targets.arch.

- bea_state_employment_income_before_lsr(): per state with the full wage
  component set (wages/supplements/contributions/residence_adjustment),
  allocate the residence adjustment to wages by wages/(wages+supplements+
  contributions), then scale every state so the residence-adjusted total
  equals the national BEA NIPA wages total. Requires all required_states with
  all four roles; bails on non-positive denominator/total. Component map,
  output variable, required states, and state-fips extractor are injected.
- bea_national_wages_record(): finds the national NIPA wages_and_salaries total
  (concept-based, US defaults injectable).
- _default_bea_state_record(): faithful synthetic-record builder (deterministic
  ids, SAINC5N lineage, scaled-to-NIPA notes); adds stratum_name/
  concept_evidence_url/legal_vintage fields to ArchTargetRecord.
- 5 more unit tests (40 total): residence-adjust+scale sums to national,
  missing-component/missing-state/zero-denominator bail, national-wages finder.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sixth and final §C derivation: the skip/blocklist surface that shapes the
clean target surface, ported from legacy microplex_us.targets.arch.

- should_skip_target_record(): drop unsupported ratio/component variables and
  national BEA regional inputs (the components the BEA derivation consumes).
- should_skip_fact_concept(): drop skip-listed Arch fact concepts.
- is_blocked_self_employment_binding(): broad business-income SE blocklist
  (marker intersection over variable/concept/source ids + constraints).
- is_bea_regional_country_record() / default_bea_regional_lineage() helpers.
  All blocklist sets are injected US config.
- 5 more unit tests (45 total).

Arch target-derivation logic (§C) is now complete: component-sum, latest
carry-forward, state->national rollup, SOI aging, BEA, skip/blocklist.
Next: the ArchTargetRecord<->TargetSpec adapters + ArchTargetProvider wiring.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rgetSet

The provider-boundary half of codex's iter280 plan: after derivations + skip
filters, convert ArchTargetRecords to the calibration-facing TargetSpec/TargetSet.

- arch_target_record_to_target_spec(): COUNT -> count target (no measure), else
  sum over the variable; constraints -> TargetFilters; injected PE entity;
  Arch lineage (ids/concept/source/geography) preserved in metadata.
- arch_records_to_target_set(): convert a derived record sequence to a TargetSet
  with injected entity_of, optional skip filter, and measure override.
- default_arch_target_name(): deterministic cell-unique target name.
- 7 unit tests.

This makes the 6 §C derivations usable end-to-end: derive ArchTargetRecords ->
filter -> convert -> TargetSet for the calibrator. Next: the derivation pipeline
orchestrator + ArchTargetProvider (load -> derive -> convert).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Completes the Arch target lane end-to-end (derive -> convert -> TargetSet):

- run_arch_derivation_pipeline(): composes the 6 §C derivations in the legacy
  order -- BEA augment -> (non-SOI current + latest carry-forward + latest/aged
  SOI) -> component sum -> state->national rollup -> skip filter. Each step runs
  only when its config is present; reference_records supplies SOI aging refs.
- ArchPipelineConfig: the declarative config the US pack supplies (component
  map, rollup/BEA/carry-forward/SOI/skip params + geo/source/state-fips fns).
- latest_soi_records_by_composition / arch_record_composition_key: faithful SOI
  composition dedup (latest period per cell).
- ArchTargetProvider: a TargetProvider that runs the pipeline over pre-loaded
  ArchTargetRecords and converts to a TargetSet, applying the query.
- 11 more tests (56 total): composition dedup, component-sum-in-pipeline, skip
  filter, provider -> TargetSet + query filtering.

Remaining for a live surface: the DB/JSONL -> ArchTargetRecord load adapter
(needs the real Arch artifact) + the US config in packs/us.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@MaxGhenis

Copy link
Copy Markdown
Contributor Author

Review blocker before this becomes a live Arch target surface:

arch_target_record_to_target_spec() currently converts only record.constraints into TargetFilters and preserves geographic_level / geography_id only as metadata. That means a state/county/district Arch record with geographic_level="STATE" and geography_id="06" but no explicit state_fips constraint becomes an unfiltered national SUM/COUNT target. Legacy arch_target_record_to_canonical_spec() appended _target_filter_for_arch_geography(record) before exposing the target, and the carry-forward cell key also depended on the canonical geographic id. This needs an injected geography-filter adapter or equivalent conversion before the provider is safe to wire into run_spec.

Related fail-closed issue: the converter treats every non-COUNT target as SUM. Legacy returned None for RATE, and COUNT records also had alias/positive-measure handling rather than counting every entity unconditionally. At minimum, unsupported target types should be rejected/skipped, and the US config should inject the count/measure/filter mapping needed to preserve the legacy target surface.

Focused checks I ran locally from a clean PR-head worktree:

  • uv run --python 3.13 ruff check src/microplex/targets/arch_derivations.py src/microplex/targets/arch_provider.py tests/targets/test_arch_derivations.py tests/targets/test_arch_provider.py passed.
  • uv run --python 3.13 --extra dev python -m pytest tests/targets/test_arch_derivations.py tests/targets/test_arch_provider.py passed: 56 passed.

I would hold merge until provider-boundary conversion has tests for at least: state geography -> target filter, RATE rejected/skipped, and COUNT targets with positive/count-domain filters where applicable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant