Skip to content

docs: operator pages (checklist, troubleshooting, alembic migrations)#53

Merged
lesnik512 merged 7 commits into
mainfrom
docs/operator-pages
Jun 11, 2026
Merged

docs: operator pages (checklist, troubleshooting, alembic migrations)#53
lesnik512 merged 7 commits into
mainfrom
docs/operator-pages

Conversation

@lesnik512

Copy link
Copy Markdown
Member

The B follow-on from #50 (docs-landing-and-comparison). Adds three new pages under a new docs/operations/ directory + a fifth top-level Operations nav section, plus five one-line cross-link callouts from existing reference pages.

Summary

  • docs/operations/checklist.md — sixteen items across six sections (Sizing / Subscribers / DLQ / Drain & lifecycle / Schema / Observability). Each item is one to two lines plus a link into the existing reference page that owns the underlying detail. Checklist is link-scaffold, not new prose — operators scan, then drill in.
  • docs/operations/troubleshooting.md — eleven symptoms in TOC-table form, each with Symptom / Likely cause / Diagnose / Fix / Reference. Covers operator-actionable signals (event=lease_lost, outbox growth patterns, idle latency, duplicate invocations, rolling-deploy leakage) plus by-design surprises (ACK_FIRST raises, OutboxResponse + foreign publisher nacked, validate_schema ImportError). Explicit attr_list anchors so the TOC jumps stay stable across mkdocs slugifier behavior.
  • docs/operations/alembic.md — what alembic revision --autogenerate actually produces against make_outbox_table() (verbatim, captured during the spec phase via a produce_migrations + render_python_code spike against an empty Postgres 17 / SQLAlchemy 2.0.50 / Alembic 1.18.4), plus the additive DLQ second migration, a CI drift-detection recipe using validate_schema(), and a step-by-step DLQ retention recipe converting to range-partitioned-by-failed_at with a monthly cron for create-next / drop-oldest.
  • mkdocs.yml — new fifth Operations section in nav.
  • docs/index.md — new decision-tree row: "Deploy to production safely → Production checklist".
  • Five callouts added to docs/usage/subscriber.md, docs/usage/dlq.md, and docs/usage/schema-validation.md pointing into the relevant operator-page sections. Same italic style the Comparison callouts (docs: rewrite landing page, reshape nav, add Comparison page #50) use.

Spec + plan: planning/active/2026-06-11-operator-pages-{design,plan}.md.

Test plan

  • just docs-build (mkdocs build --strict) passes clean — no anchor warnings, no orphaned pages
  • just lint passes (eof-fixer, ruff format, ruff check, ty check)
  • Alembic autogenerate spike verified against spec prediction: three partial indexes with predicates acquired_token IS NULL / acquired_token IS NOT NULL / timer_id IS NOT NULL, all match
  • Reviewer: just docs-serve and confirm the sidebar shows five sections in order: Overview → Getting started → Concepts → Guides → Reference → Operations
  • Reviewer: open Troubleshooting and confirm the TOC table jumps to the matching ## heading on click for every row (eleven symptoms)
  • Reviewer: open Alembic migrations and confirm the autogenerate code blocks render verbatim with the right partial-index predicates

Out of scope (per spec § Non-goals)

  • Migration-recipe regression tests in tests/integration.py — would pin the autogenerate output against drift; strong follow-up but adds test-suite scope.
  • Performance / benchmarking page (no public data)
  • Incident postmortem template (no incident history)
  • Promoting planning/architecture/ into user-facing docs (separate spec)

🤖 Generated with Claude Code

lesnik512 and others added 7 commits June 11, 2026 14:10
Three new pages under a new docs/operations/ section: Production
checklist (sizing/subscribers/DLQ/drain/schema/observability scaffold
linking into existing references), Troubleshooting playbook
(symptom → likely cause → diagnose → fix → reference, eleven symptoms),
Alembic migrations (literal autogenerate output + DLQ-addition-as-
second-migration + drift detection + partition-retention recipe).

Spec captures the design and structural calls; plan walks an executor
through eight tasks starting with an Alembic autogenerate spike to
ground the literal-sample claims.

The B follow-on from #50 (docs-landing-and-comparison).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ran autogenerate against an empty Postgres 17 schema with
make_outbox_table() (and then with make_dlq_table() added) using
SQLAlchemy 2.0.50 + Alembic 1.18.4. The captured ops are pasted
verbatim into spec §4a and §4b under "Captured autogenerate output".

Verified against §4a's prediction: three partial indexes with
predicates acquired_token IS NULL / IS NOT NULL / timer_id IS NOT NULL
— all match. Two book-keeping columns the user-facing reference pages
don't name explicitly (attempts_count, first_attempt_at,
last_attempt_at) are flagged in a note so the page author transcribes
them faithfully without editorializing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three stub pages under docs/operations/ (checklist.md,
troubleshooting.md, alembic.md) plus a fifth Operations section in
mkdocs nav. Content lands in follow-up commits per the plan.

Each stub is just the H1 — mkdocs build --strict is clean because the
nav refs resolve, and the rendered sidebar shows the new section in
place even before content arrives.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sixteen items across six sections (Sizing, Subscribers, DLQ, Drain &
lifecycle, Schema, Observability). Each item is one to two lines
plus a relative link into the existing reference page that owns the
underlying detail — checklist is link-scaffold, not new prose.

Two link anchors into operations/troubleshooting.md and
operations/alembic.md don't resolve yet; T5 and T6 land those pages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Eleven symptom → cause → diagnose → fix → reference subsections,
plus a TOC table at the top with anchor jumps. Symptoms range from
operator-actionable (event=lease_lost, outbox growth patterns,
duplicate invocations) through "by design" surprises that need a
"don't fight this; here's what to do instead" explanation
(ACK_FIRST raises, OutboxResponse + foreign publisher nacked).

Each section uses explicit attr_list anchors so the TOC jumps are
stable across mkdocs slugifier behavior — backtick + equals +
slash in heading content otherwise collapse unpredictably.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four sections (initial migration with the captured autogenerate
output annotated for load-bearing partial indexes, adding the DLQ as
an additive second migration, drift detection in CI via
validate_schema, and DLQ retention via range-partition rotation).

The DLQ retention walkthrough includes the one-time conversion to
PARTITION BY RANGE (failed_at) and a monthly cron snippet that
create-next + drop-oldest. Both are O(1) regardless of DLQ row count
and replace the row-by-row DELETE pattern at higher volume.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
One new decision-tree row in docs/index.md (Deploy to production
safely → Production checklist) and five one-line italic callouts
into the operator pages from existing reference pages:

- subscriber.md § Connection budget → Production checklist § Sizing
- subscriber.md § Slow handlers — dedicated queue → Troubleshooting §
  event=lease_lost
- dlq.md § Metric: dlq_written → Production checklist § DLQ
- dlq.md § Retention → Alembic migrations § DLQ retention via
  partition drop
- schema-validation.md § Where to call it → Alembic migrations §
  Drift detection in CI

Same italic + en-dash style the docs-landing-and-comparison PR (#50)
established for the Comparison callouts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lesnik512 lesnik512 merged commit 16337b7 into main Jun 11, 2026
3 checks passed
@lesnik512 lesnik512 deleted the docs/operator-pages branch June 11, 2026 13:19
lesnik512 added a commit that referenced this pull request Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant