diff --git a/docs/index.md b/docs/index.md index c7abb2a..15f148b 100644 --- a/docs/index.md +++ b/docs/index.md @@ -40,6 +40,7 @@ process, no Kafka. | Relay outbox rows to Kafka / RabbitMQ / NATS / Redis | [Relay to Kafka / RabbitMQ / NATS](usage/relay.md) | | Understand the architecture before adopting | [How it works](introduction/how-it-works.md) | | Compare against CDC / Kafka transactions / a hand-rolled outbox | [Comparison](concepts/comparison.md) | +| Deploy to production safely | [Production checklist](operations/checklist.md) | | Install and write the first publisher / subscriber | [Installation](introduction/installation.md) → [Basic usage](usage/basic.md) | ## Documentation diff --git a/docs/operations/alembic.md b/docs/operations/alembic.md new file mode 100644 index 0000000..c13076a --- /dev/null +++ b/docs/operations/alembic.md @@ -0,0 +1,238 @@ +# Alembic migrations + +The package never creates or migrates your schema — that's +[Alembic](https://alembic.sqlalchemy.org/)'s job. This page shows what +Alembic produces against `make_outbox_table()` and +`make_dlq_table()`, and gives recipes for drift detection and DLQ +retention via partition rotation. + +## Initial migration + +Add `make_outbox_table(metadata, table_name="outbox")` to whatever +`MetaData` your Alembic `env.py` exposes as `target_metadata`, then +run `alembic revision --autogenerate -m "outbox"`. The captured +`upgrade()` body (SQLAlchemy 2.0.50, Alembic 1.18.4, Postgres 17, +against an empty schema): + +```python +# ### commands auto generated by Alembic - please adjust! ### +op.create_table('outbox', + sa.Column('id', sa.BigInteger(), autoincrement=True, nullable=False), + sa.Column('queue', sa.String(length=255), nullable=False), + sa.Column('payload', sa.LargeBinary(), nullable=False), + sa.Column('headers', postgresql.JSONB(astext_type=Text()), nullable=True), + sa.Column('attempts_count', sa.BigInteger(), server_default='0', nullable=False), + sa.Column('deliveries_count', sa.BigInteger(), server_default='0', nullable=False), + sa.Column('created_at', sa.DateTime(timezone=True), server_default=sa.text('now()'), nullable=False), + sa.Column('next_attempt_at', sa.DateTime(timezone=True), server_default=sa.text('now()'), nullable=False), + sa.Column('first_attempt_at', sa.DateTime(timezone=True), nullable=True), + sa.Column('last_attempt_at', sa.DateTime(timezone=True), nullable=True), + sa.Column('acquired_at', sa.DateTime(timezone=True), nullable=True), + sa.Column('acquired_token', sa.Uuid(), nullable=True), + sa.Column('timer_id', sa.String(length=255), nullable=True), + sa.PrimaryKeyConstraint('id') +) +op.create_index('outbox_lease_idx', 'outbox', ['queue', 'acquired_at'], unique=False, + postgresql_where=sa.text('acquired_token IS NOT NULL')) +op.create_index('outbox_pending_idx', 'outbox', ['queue', 'next_attempt_at'], unique=False, + postgresql_where=sa.text('acquired_token IS NULL')) +op.create_index('outbox_timer_id_uq', 'outbox', ['queue', 'timer_id'], unique=True, + postgresql_where=sa.text('timer_id IS NOT NULL')) +# ### end Alembic commands ### +``` + +The three indexes carry **load-bearing partial predicates**: + +- `outbox_pending_idx` — `(queue, next_attempt_at) WHERE acquired_token IS NULL`. + Branch A of the fetch CTE (unleased rows) is served by this index. The + WHERE clause is written so Postgres' planner recognizes the implied + predicate and uses the partial index; drop the predicate and the planner + falls back to a seq-scan as the table grows. +- `outbox_lease_idx` — `(queue, acquired_at) WHERE acquired_token IS NOT NULL`. + Branch B of the fetch CTE (expired-lease reclaim) is served by this + index. Same story: predicate is load-bearing for fetch performance. +- `outbox_timer_id_uq` — unique `(queue, timer_id) WHERE timer_id IS NOT NULL`. + Backs the `timer_id` dedup contract via + `pg_insert(...).on_conflict_do_nothing(...)`. Without the partial + predicate, the unique constraint applies to all rows and breaks + non-timer publishes. + +Columns the operator can mostly ignore: `attempts_count` and +`deliveries_count` (broker book-keeping for "handler invocations" vs +"claims including expired-lease re-claims"), `first_attempt_at` and +`last_attempt_at` (debugging aids on retry-heavy rows). All four are +maintained by the broker; no application code touches them. + +The `# please adjust!` comment from Alembic is misleading here — +**don't adjust**. The column types, predicates, and indexes are +exactly what the broker depends on. The +[`validate_schema()`](../usage/schema-validation.md) check will refuse +to start a service whose live DB drifts from this declaration. + +## Adding the DLQ after the fact + +To opt into [Dead-letter queue](../usage/dlq.md) audit, add +`make_dlq_table(metadata, table_name="outbox_dlq")` to your +`MetaData` and run `alembic revision --autogenerate -m "outbox-dlq"`. +The captured `upgrade()` body: + +```python +# ### commands auto generated by Alembic - please adjust! ### +op.create_table('outbox_dlq', + sa.Column('id', sa.BigInteger(), autoincrement=True, nullable=False), + sa.Column('original_id', sa.BigInteger(), nullable=False), + sa.Column('queue', sa.String(length=255), nullable=False), + sa.Column('payload', sa.LargeBinary(), nullable=False), + sa.Column('headers', postgresql.JSONB(astext_type=Text()), nullable=True), + sa.Column('deliveries_count', sa.BigInteger(), nullable=False), + sa.Column('created_at', sa.DateTime(timezone=True), nullable=False), + sa.Column('failed_at', sa.DateTime(timezone=True), server_default=sa.text('now()'), nullable=False), + sa.Column('failure_reason', sa.String(length=64), nullable=False), + sa.Column('last_exception', sa.String(), nullable=True), + sa.PrimaryKeyConstraint('id') +) +op.create_index('outbox_dlq_queue_failed_idx', 'outbox_dlq', ['queue', 'failed_at'], unique=False) +# ### end Alembic commands ### +``` + +This is **purely additive**: no `op.alter_table` against the outbox +table itself, no column add, no constraint flip. The runtime change +that activates the DLQ is the broker's +[atomicity CTE](../usage/dlq.md#atomicity) (DELETE … RETURNING → INSERT +INTO ``), driven by `OutboxBroker(..., dlq_table=…)`, not by +schema state. + +The non-unique `outbox_dlq_queue_failed_idx` on `(queue, failed_at)` +supports "show me recent failures for queue X" queries and +double-duties as the pruning index when [§ DLQ retention via partition +drop](#dlq-retention-via-partition-drop) converts the table to +partitioned. + +## Drift detection in CI + +Run [`validate_schema()`](../usage/schema-validation.md) after +`alembic upgrade head` in your CI pipeline. A small standalone +script: + +```python +import asyncio + +from sqlalchemy import MetaData +from sqlalchemy.ext.asyncio import create_async_engine + +from faststream_outbox import OutboxBroker, make_outbox_table + + +async def main() -> None: + metadata = MetaData() + outbox_table = make_outbox_table(metadata, table_name="outbox") + engine = create_async_engine("postgresql+asyncpg://...") + broker = OutboxBroker(engine, outbox_table=outbox_table) + await broker.validate_schema() + + +asyncio.run(main()) +``` + +Non-zero exit on drift; CI fails before "deploy". + +This check is **opt-in for `/health`** and not always-on at +`broker.start()`. The reason: a running migration plus an always-on +validator would race. Operators must be able to roll forward a new +schema version without spinning every pod into a crash loop. The drift +check belongs *between* `alembic upgrade head` and the deploy step, +not inside the running service. + +If you also pass `dlq_table=make_dlq_table(metadata)` when +constructing the broker, `validate_schema()` checks both tables in +one call and surfaces drift on either one. + +## DLQ retention via partition drop { #dlq-retention-via-partition-drop } + +Plain `DELETE FROM outbox_dlq WHERE failed_at < now() - interval '90 +days'` works fine for low-volume DLQs (< ~1 GB / month) and needs no +schema change. For higher volume, converting the DLQ to a +range-partitioned table by `failed_at` lets you **drop entire +partitions** instead of deleting row by row — orders-of-magnitude +faster, no vacuum debt. + +### One-time migration to partitioned DLQ + +```python +# ### Convert outbox_dlq to range-partitioned by failed_at ### + +# Rename the existing table out of the way. +op.rename_table('outbox_dlq', 'outbox_dlq_old') + +# Create the partitioned parent. The PRIMARY KEY must include the +# partition key, so we extend it to (id, failed_at). +op.execute(""" + CREATE TABLE outbox_dlq ( + id BIGSERIAL NOT NULL, + original_id BIGINT NOT NULL, + queue VARCHAR(255) NOT NULL, + payload BYTEA NOT NULL, + headers JSONB, + deliveries_count BIGINT NOT NULL, + created_at TIMESTAMPTZ NOT NULL, + failed_at TIMESTAMPTZ NOT NULL DEFAULT now(), + failure_reason VARCHAR(64) NOT NULL, + last_exception TEXT, + PRIMARY KEY (id, failed_at) + ) PARTITION BY RANGE (failed_at); +""") + +op.execute(""" + CREATE INDEX outbox_dlq_queue_failed_idx + ON outbox_dlq (queue, failed_at); +""") + +# Create initial partitions covering recent + current month. +op.execute(""" + CREATE TABLE outbox_dlq_2026_05 PARTITION OF outbox_dlq + FOR VALUES FROM ('2026-05-01') TO ('2026-06-01'); + CREATE TABLE outbox_dlq_2026_06 PARTITION OF outbox_dlq + FOR VALUES FROM ('2026-06-01') TO ('2026-07-01'); +""") + +# Move surviving rows into the partitioned table; rows older than the +# retention window stay in outbox_dlq_old (or get dropped explicitly). +op.execute(""" + INSERT INTO outbox_dlq + SELECT * FROM outbox_dlq_old + WHERE failed_at >= '2026-05-01'; +""") + +op.drop_table('outbox_dlq_old') +``` + +`validate_schema()` continues to work against the partitioned table — +Alembic's autogenerate ignores partition boundaries when comparing +column shape. + +### Monthly cron: create next, drop oldest + +Run from a Postgres cron (or pgAgent, or a sidecar job) once a month. +Adjust the retention window to taste: + +```sql +DO $$ +DECLARE + next_month DATE := date_trunc('month', now() + interval '1 month'); + next_month_end DATE := next_month + interval '1 month'; + drop_month DATE := date_trunc('month', now() - interval '12 months'); + next_name TEXT := 'outbox_dlq_' || to_char(next_month, 'YYYY_MM'); + drop_name TEXT := 'outbox_dlq_' || to_char(drop_month, 'YYYY_MM'); +BEGIN + EXECUTE format( + 'CREATE TABLE IF NOT EXISTS %I PARTITION OF outbox_dlq + FOR VALUES FROM (%L) TO (%L);', + next_name, next_month, next_month_end + ); + EXECUTE format('DROP TABLE IF EXISTS %I;', drop_name); +END $$; +``` + +The pattern is "always have next month's partition before any row in +it could land" + "drop everything older than the retention window." +Both operations are O(1) regardless of DLQ row count. diff --git a/docs/operations/checklist.md b/docs/operations/checklist.md new file mode 100644 index 0000000..20fd55e --- /dev/null +++ b/docs/operations/checklist.md @@ -0,0 +1,83 @@ +# Production checklist + +Scannable scaffold of pre-launch checks. Each item is one to two lines; +the link points at the existing reference page that owns the full +story. + +## Sizing + +- [ ] **Engine pool ≥ `Σ subs × (max_workers + 1)`** — every + subscriber holds `max_workers + 1` SQLAlchemy pool connections (one + writer per worker + one fetch) plus one raw asyncpg connection for + `LISTEN`. Sub-budget formula in [Subscriber § Connection + budget](../usage/subscriber.md#connection-budget). +- [ ] **Postgres `max_connections` ≥ `replicas × Σ subs × (max_workers + 1)`** + — the formula is per-process; rolling deploys multiply it. + Failure mode: pods refuse with `FATAL: too many connections`. + +## Subscribers + +- [ ] **`lease_ttl_seconds` > handler P99 with margin** — otherwise + healthy in-flight handlers race their own lease expiry. The lease + cutoff is server-side `make_interval(...)`, immune to clock skew. + Tuning: [Subscriber § Slow handlers — dedicated + queue](../usage/subscriber.md#slow-handlers-dedicated-queue). +- [ ] **Slow handlers segregated** onto their own subscriber with a + taller `lease_ttl_seconds`. Don't raise it globally — that delays + reclaim of *actually* stuck rows everywhere. +- [ ] **`max_deliveries` set** (or knowingly unbounded). Default is + unbounded; pair with a non-`NoRetry()` retry strategy or + wedge-prone handlers can replay forever. +- [ ] **Retry strategy chosen.** Default + `ExponentialRetry(initial=1, multiplier=2, max=300, attempts=10, + jitter=0.2)` is fine for most. Opt into `NoRetry()` explicitly for + an audit feed. + +## DLQ + +- [ ] **`dlq_table=` configured** — opt-in but recommended for any + service where terminal failures need forensic recovery. See + [Dead-letter queue](../usage/dlq.md). +- [ ] **Alert on `nacked_terminal` rate vs `dlq_written` divergence** + — persistent divergence means either DLQ schema drift (CTE rolls + back) or `lease_ttl_seconds` too low. See [DLQ § Metric: + dlq_written](../usage/dlq.md#metric-dlq_written). +- [ ] **DLQ retention plan.** Partition by `failed_at` + cron-drop old + partitions, or a simple `DELETE … WHERE failed_at < interval` cron + for low volume. Walk-through: [Alembic migrations § DLQ retention via + partition drop](./alembic.md#dlq-retention-via-partition-drop). + +## Drain & lifecycle + +- [ ] **`graceful_timeout` ≥ handler P99 + margin** — otherwise + `OutboxSubscriber.stop()` cancels in-flight work and rows are + reclaimed mid-handler. +- [ ] **Kubernetes `terminationGracePeriodSeconds` ≥ broker + `graceful_timeout`** with margin for the parallel-subscriber drain. + The broker gathers subscriber drains in parallel, but k8s + `SIGKILL`s after the grace period regardless. + +## Schema + +- [ ] **`/health` calls `validate_schema()`** — opt-in; requires the + `[validate]` extra. Do **not** call at `broker.start()` — that + would crash-loop on a pending migration. See [Schema validation § + Where to call it](../usage/schema-validation.md#where-to-call-it). +- [ ] **Outbox `table_name` ≤ ~56 chars** — NOTIFY channel name is + `outbox_`. Postgres' 63-char identifier limit silently + truncates longer names and `LISTEN/NOTIFY` short-circuit degrades + to plain polling. + +## Observability + +- [ ] **`metrics_recorder` set, native middleware registered, or + both** — the recommended setup is both. See [Observability § + Layering](../usage/observability.md#layering-middleware-seam-vs-recorder-seam). +- [ ] **Alert on `lease_lost` rate** — non-zero means + `lease_ttl_seconds < handler P99` for at least one subscriber. See + [Troubleshooting § `event=lease_lost`](./troubleshooting.md#event-lease_lost-recurring-in-logs). +- [ ] **`LISTEN/NOTIFY` fallback warning checked at startup** — if + the asyncpg connection fails (driver missing, permission error), + the subscriber logs once and falls back to polling. Operator + silently lives with up-to-`max_fetch_interval` idle latency + otherwise. diff --git a/docs/operations/troubleshooting.md b/docs/operations/troubleshooting.md new file mode 100644 index 0000000..f7aa908 --- /dev/null +++ b/docs/operations/troubleshooting.md @@ -0,0 +1,265 @@ +# Troubleshooting + +Symptom → likely cause → fix. Each section below is the same shape: +what you see, what's probably wrong, how to confirm, what to change, +and a link into the reference page that owns the underlying design. + +| Symptom | Likely cause | +|---|---| +| [`event=lease_lost` recurring in logs](#event-lease_lost-recurring-in-logs) | Handler P99 > `lease_ttl_seconds` | +| [Outbox row count grows + `lease_lost` spike](#outbox-row-count-grows-lease_lost-spike) | DLQ CTE failing (DLQ schema drift) | +| [Outbox row count grows, no `lease_lost`](#outbox-row-count-grows-no-lease_lost) | Fetch loop not running, or rows future-dated | +| [Idle dispatch latency > `max_fetch_interval`](#idle-dispatch-latency-max_fetch_interval) | LISTEN setup failed → polling fallback | +| [Subscriber blocks at `broker.start()`](#subscriber-blocks-at-brokerstart) | Engine pool exhausted on writer-connection checkout | +| [Duplicate handler invocations](#duplicate-handler-invocations) | Lease expired before handler returned, or handler not idempotent | +| [Rolling deploy leaks rows](#rolling-deploy-leaks-rows) | `graceful_timeout` < handler P99, or k8s grace too short | +| [`activate_in` / `activate_at` fires immediately in tests](#activate_in-activate_at-fires-immediately-in-tests) | `TestOutboxBroker(run_loops=False)` ignores scheduling | +| [`AckPolicy.ACK_FIRST` raises `ValueError` at registration](#ackpolicyack_first-raises-valueerror-at-registration) | By design (would defeat outbox reliability) | +| [`OutboxResponse(...)` + foreign-publisher decorator gets nacked](#outboxresponse-foreign-publisher-decorator-gets-nacked) | By design (dual-fire footgun) | +| [`validate_schema()` raises `ImportError`](#validate_schema-raises-importerror) | `[validate]` extra not installed | + +## `event=lease_lost` recurring in logs { #event-lease_lost-recurring-in-logs } + +**Symptom.** WARNING-level logs with structured field +`event=lease_lost`, typically with `phase=terminal` or `phase=retry`, +one per affected row. + +**Likely cause.** The subscriber's `lease_ttl_seconds` is shorter than +the handler's P99 duration. A handler took longer than the lease, +another fetch reclaimed the row mid-flight, and the original handler's +terminal `DELETE` / `UPDATE` matched zero rows. + +**Diagnose.** Grep for `event=lease_lost` over the last hour and +compare the rate against `dispatched`. A non-zero baseline rate +(rather than occasional spikes) confirms TTL is the issue. + +**Fix.** Raise `lease_ttl_seconds` for the affected subscriber, OR +segregate slow work onto its own subscriber with a taller TTL +(recommended — keeps the fast queue's reclaim tight). TTL must exceed +handler P99 with margin. + +**Reference.** [Subscriber § Slow handlers — dedicated +queue](../usage/subscriber.md#slow-handlers-dedicated-queue). + +## Outbox row count grows + `lease_lost` spike { #outbox-row-count-grows-lease_lost-spike } + +**Symptom.** Two things at once: row count in the outbox table grows +without bound, *and* `event=lease_lost` log rate spikes. + +**Likely cause.** The DLQ CTE is failing on every terminal flush — +DLQ schema drift means the `INSERT INTO ` clause inside the +`WITH deleted AS (DELETE … RETURNING …)` statement rolls back the +DELETE too. Rows stay in the outbox, leases keep expiring, the +pattern compounds. + +**Diagnose.** Run `await broker.validate_schema()` against the live +DB (the `[validate]` extra is required). It will surface missing +columns / indexes on the DLQ table. + +**Fix.** Bring the DLQ schema up to spec (apply the missing migration, +or rename / drop the drifted column / index). After the schema is +correct, the next claim of each stuck row flushes through the CTE +and the outbox drains naturally. + +**Reference.** [DLQ § Atomicity](../usage/dlq.md#atomicity), [Schema +validation](../usage/schema-validation.md). + +## Outbox row count grows, no `lease_lost` { #outbox-row-count-grows-no-lease_lost } + +**Symptom.** Outbox rows accumulate, but logs are clean — no +`lease_lost`, no exceptions. + +**Likely cause.** Either no subscriber is registered for that queue, +or the rows are future-dated (`activate_in` / `activate_at` set) and +genuinely waiting to fire. + +**Diagnose.** Inspect a stuck row's `next_attempt_at` — if it's in the +future, the row is correctly waiting. Otherwise check whether a +subscriber is registered: walk `broker.subscribers` (the *property*, +which covers router-attached subscribers — `broker._subscribers` will +miss them). + +**Fix.** Register the subscriber, or adjust the producer's `activate_*` +arg if the future date was unintentional. + +**Reference.** [Subscriber](../usage/subscriber.md), [Router § Gotcha: +walking every subscriber](../usage/router.md#gotcha-walking-every-subscriber), +[Timers](../usage/timers.md). + +## Idle dispatch latency > `max_fetch_interval` { #idle-dispatch-latency-max_fetch_interval } + +**Symptom.** Rows arrive but take up to `max_fetch_interval` (default +10 s) to dispatch, even though no other rows are in flight. NOTIFY +should short-circuit the idle wait to ~10 ms. + +**Likely cause.** `LISTEN` setup failed at subscriber start. The raw +asyncpg connection that owns `LISTEN outbox_` is separate from +the SQLAlchemy fetch connection; common failure modes are: the +asyncpg driver isn't installed (no `[asyncpg]` extra), the engine URL +is not asyncpg, or Postgres user lacks `LISTEN` permission. + +**Diagnose.** Check startup logs for a WARNING noting NOTIFY fallback +to polling. The subscriber logs it once and continues without crashing. + +**Fix.** Install the `[asyncpg]` extra and use an asyncpg-driven +engine URL (`postgresql+asyncpg://...`). Restart the subscriber. + +**Reference.** [Installation § Optional extras +](../introduction/installation.md#optional-extras), [How it works § +Fetch loop](../introduction/how-it-works.md#subscriber-two-async-loops). + +## Subscriber blocks at `broker.start()` { #subscriber-blocks-at-brokerstart } + +**Symptom.** Process hangs on `broker.start()` (or the FastAPI +`include_router` lifespan) and never completes startup. + +**Likely cause.** SQLAlchemy pool exhausted on the per-worker writer +connection checkout. Each subscriber needs `max_workers + 1` pool +connections; the default pool is `pool_size=5, max_overflow=10`. A +handful of single-worker subscribers fits, but a fleet of high- +`max_workers` subscribers does not. + +**Diagnose.** Inspect the engine pool. Compute `Σ subs × (max_workers ++ 1)` from your subscriber registrations and compare to +`pool_size + max_overflow`. + +**Fix.** Raise `pool_size` / `max_overflow` on the engine, OR lower +`max_workers` per subscriber. Also confirm Postgres +`max_connections ≥ replicas × Σ subs × (max_workers + 1)` — rolling +deploys multiply the demand. + +**Reference.** [Subscriber § Connection +budget](../usage/subscriber.md#connection-budget), [Production +checklist § Sizing](./checklist.md#sizing). + +## Duplicate handler invocations + +**Symptom.** The same outbox row's handler runs more than once. Side +effects double up if the handler isn't idempotent. + +**Likely cause.** Either the handler's wall-clock duration exceeded +`lease_ttl_seconds` and another fetch reclaimed the row mid-flight, +or the worker crashed between the handler's external side effect and +the terminal `DELETE`. Both are at-least-once-delivery edge cases. + +**Diagnose.** Cross-reference handler-side logs (the side effect) +with `event=lease_lost` logs. Matching row IDs confirm TTL is too +short. Crash-induced duplicates correlate with worker-process +restarts. + +**Fix.** Two layers: (a) make handlers idempotent — this is a +contract of the outbox pattern, not a knob, and (b) tune +`lease_ttl_seconds` above handler P99 so healthy handlers don't +race their lease. + +**Reference.** [How it works § At-least-once +delivery](../introduction/how-it-works.md#at-least-once-delivery), +[Subscriber § Slow handlers — dedicated +queue](../usage/subscriber.md#slow-handlers-dedicated-queue). + +## Rolling deploy leaks rows + +**Symptom.** During a rolling restart, outbox rows are left in the +"acquired" state until lease expiry, even though handlers were +nominally healthy. Drain duration appears longer than expected. + +**Likely cause.** Either the broker's `graceful_timeout` is shorter +than the in-flight handler's remaining work, or Kubernetes +`terminationGracePeriodSeconds` is shorter than the broker's +`graceful_timeout` × parallel-drain factor — `SIGKILL` arrives mid- +drain. + +**Diagnose.** Time a clean shutdown locally (`docker compose kill -s +SIGTERM application`) and compare to your k8s grace period. Look for +log lines indicating drain abandonment. + +**Fix.** Raise `graceful_timeout` past handler P99 + margin. Raise +`terminationGracePeriodSeconds` past `graceful_timeout` + buffer for +parallel-subscriber drain. The `dispatch_one` shutdown-race guard is +always on; you don't need to opt into it. + +**Reference.** [Production checklist § Drain & +lifecycle](./checklist.md#drain-lifecycle). + +## `activate_in` / `activate_at` fires immediately in tests { #activate_in-activate_at-fires-immediately-in-tests } + +**Symptom.** A unit test publishes a row with `activate_in=30s` and +the handler runs synchronously inside `await broker.publish(...)`. + +**Likely cause.** By design. `TestOutboxBroker(run_loops=False)` +(the default) drives handlers synchronously through `dispatch_one`, +which ignores `next_attempt_at`. This is the documented test-broker +contract — trades production parity for test ergonomics. + +**Diagnose.** Check the call site: `TestOutboxBroker(broker)` → +sync mode, expected immediate firing. + +**Fix.** Opt into `TestOutboxBroker(broker, run_loops=True)` for +tests that need scheduled delivery to actually wait. Loop mode runs +the real `_fetch_loop` / `_worker_loop` against the fake client. + +**Reference.** [Testing § Loop-driven +mode](../usage/testing.md#loop-driven-mode), [Timers § Test broker +note](../usage/timers.md#test-broker-note). + +## `AckPolicy.ACK_FIRST` raises `ValueError` at registration { #ackpolicyack_first-raises-valueerror-at-registration } + +**Symptom.** `@broker.subscriber("q", ack_policy=AckPolicy.ACK_FIRST)` +fails with `ValueError` at decoration time. + +**Likely cause.** By design. `ACK_FIRST` would delete the outbox row +*before* the handler runs, so a handler crash would silently drop +the message — exactly the failure mode the outbox pattern exists to +prevent. + +**Diagnose.** None needed; the message identifies the policy. + +**Fix.** Use the default `AckPolicy.NACK_ON_ERROR` (retry on handler +exception via the configured retry strategy), or +`AckPolicy.REJECT_ON_ERROR` (delete on first failure), or +`AckPolicy.MANUAL` (handler calls `ack` / `nack` / `reject`). + +**Reference.** [Subscriber § Ack +policy](../usage/subscriber.md#ack-policy). + +## `OutboxResponse(...)` + foreign-publisher decorator gets nacked { #outboxresponse-foreign-publisher-decorator-gets-nacked } + +**Symptom.** A handler with both `@kafka_pub` and an +`OutboxResponse(...)` return value gets nacked on every dispatch, with +a `_OutboxConfigError` logged. + +**Likely cause.** By design. The combination would both insert a row +into the outbox *and* publish to Kafka — a dual-fire that doubles +delivery. The subscriber refuses the chain composition by raising +`_OutboxConfigError` via the `process_message` override; it rides the +normal nack path so the row is retried (and logged) until the +configuration is fixed. + +**Diagnose.** Inspect the handler decorator stack and return type. + +**Fix.** Pick one path. Either `return body` plain (the foreign +publisher picks it up) or `return OutboxResponse(body, queue="...", +session=...)` (an outbox-internal chain) but not both. + +**Reference.** [Relay § What not to do](../usage/relay.md#what-not-to-do), +[Publisher § Chained +publishing](../usage/publisher.md#chained-publishing). + +## `validate_schema()` raises `ImportError` { #validate_schema-raises-importerror } + +**Symptom.** Calling `await broker.validate_schema()` raises +`ImportError("requires alembic")`. + +**Likely cause.** The `[validate]` extra isn't installed. Alembic is +an optional dependency by design — every other code path works +without it, but the schema validator delegates to Alembic's +`autogenerate.compare_metadata` and so requires it. + +**Diagnose.** `pip show alembic` returns nothing, or +`pip list | grep alembic` is empty. + +**Fix.** `pip install 'faststream-outbox[validate]'`. The validator +runs unchanged after that; nothing else in the package needs to +change. + +**Reference.** [Schema validation](../usage/schema-validation.md). diff --git a/docs/usage/dlq.md b/docs/usage/dlq.md index d3d65fe..665e92c 100644 --- a/docs/usage/dlq.md +++ b/docs/usage/dlq.md @@ -159,6 +159,8 @@ both are operator-actionable signals. See [Observability](./observability.md) for the broader recorder + middleware story. +*Operator playbook: [Production checklist § DLQ](../operations/checklist.md#dlq).* + ## Retention There is no built-in pruning. Operators are responsible for archival or @@ -173,6 +175,8 @@ inflow. For low-volume DLQs a plain `DELETE FROM WHERE failed_at < now() - interval '90 days'` from a daily cron is enough. +*Step-by-step: [Alembic migrations § DLQ retention via partition drop](../operations/alembic.md#dlq-retention-via-partition-drop).* + ## Test broker `TestOutboxBroker` accumulates audit rows in diff --git a/docs/usage/schema-validation.md b/docs/usage/schema-validation.md index ed1db3b..de04e2d 100644 --- a/docs/usage/schema-validation.md +++ b/docs/usage/schema-validation.md @@ -75,6 +75,8 @@ async def main() -> None: asyncio.run(main()) ``` +*CI recipe: [Alembic migrations § Drift detection in CI](../operations/alembic.md#drift-detection-in-ci).* + ## In tests `FakeOutboxClient.validate_schema()` raises `NotImplementedError` — there diff --git a/docs/usage/subscriber.md b/docs/usage/subscriber.md index 755f55f..c564e7a 100644 --- a/docs/usage/subscriber.md +++ b/docs/usage/subscriber.md @@ -110,6 +110,8 @@ stuck-row reclaim fast; the tall TTL on the slow queue tolerates outliers without slowing reclaim of genuinely stuck rows elsewhere. Producers route to the appropriate queue at `publish` time. +*See also [Troubleshooting § `event=lease_lost`](../operations/troubleshooting.md#event-lease_lost-recurring-in-logs).* + ## Ack policy The default is `AckPolicy.NACK_ON_ERROR`: on a handler exception, the retry @@ -206,6 +208,8 @@ Postgres `max_connections` needs to cover `replicas × Σ subscribers × (max_workers + 1)` — otherwise additional replicas (or rolling deployments) will be refused at startup with `FATAL: too many connections`. +*Operator-side: [Production checklist § Sizing](../operations/checklist.md#sizing).* + ## Read-only inspection `subscriber.get_one()` and `async for msg in subscriber:` are **not diff --git a/mkdocs.yml b/mkdocs.yml index 665bff9..5244266 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -23,6 +23,10 @@ nav: - Router: usage/router.md - Dead-letter queue: usage/dlq.md - Observability: usage/observability.md + - Operations: + - Production checklist: operations/checklist.md + - Troubleshooting: operations/troubleshooting.md + - Alembic migrations: operations/alembic.md theme: name: material features: diff --git a/planning/README.md b/planning/README.md index 137629b..27f00a3 100644 --- a/planning/README.md +++ b/planning/README.md @@ -11,7 +11,10 @@ points. ## Active -_None._ +- **[operator-pages](active/2026-06-11-operator-pages-design.md)** + — Three new pages under a new `docs/operations/` section: Production + checklist, Troubleshooting playbook, Alembic migrations. The B + follow-on from #50. ## Archived (shipped) diff --git a/planning/active/2026-06-11-operator-pages-design.md b/planning/active/2026-06-11-operator-pages-design.md new file mode 100644 index 0000000..5b037bb --- /dev/null +++ b/planning/active/2026-06-11-operator-pages-design.md @@ -0,0 +1,520 @@ +--- +status: draft +date: 2026-06-11 +slug: operator-pages +supersedes: null +superseded_by: null +pr: null +outcome: null +--- + +# Design: Operator pages — Production checklist, Troubleshooting, Alembic migrations + +## Summary + +Add three new pages under a new `docs/operations/` directory, surfaced +as a fifth top-level **Operations** section in the mkdocs nav: + +1. **Production checklist** — scannable scaffold of one-line items + covering sizing, subscribers, DLQ, drain, schema, and observability. + Each item links into the existing reference page that already owns + the full story. +2. **Troubleshooting** — symptom → likely cause → fix playbook for the + load-bearing signals operators see: `event=lease_lost`, outbox + growth, idle latency above `max_fetch_interval`, duplicate + invocations, rolling-deploy row leaks, test-broker scheduling + surprises, and a handful of "by design" raises. +3. **Alembic migrations** — what `alembic revision --autogenerate` + actually produces for `make_outbox_table()` and + `make_dlq_table()`, including the partial indexes and the DLQ- + addition-as-second-migration shape; plus a drift-detection-in-CI + recipe and DLQ partition-retention recipe. + +Cross-links from five existing pages point at the relevant operator-page +section. No file moves; no existing prose rewritten. + +The B follow-on from +[`docs-landing-and-comparison`](../archived/2026-06-10-docs-landing-and-comparison-design.md) +(non-goal §B). The conventions and IA those PRs landed (#49, #50) +anticipated this — the docs spec noted "Operations could become a fifth +section." + +## Motivation + +The architecture already exposes the right signals (`event=lease_lost`, +`fetched`/`dispatched`/`acked`/`nacked_*`/`lease_lost`/`dlq_written` +recorder events, three terminal-failure reasons, partial-index +predicates that are load-bearing for fetch performance, a +load-bearing dispatch-shutdown race guard, etc.). The reference pages +document each *in place*, but **an operator deploying for the first time +must assemble the checklist from five pages**: + +- `subscriber.md § Connection budget` — engine pool sizing. +- `subscriber.md § Slow handlers — dedicated queue` — `lease_ttl_seconds` + vs handler P99 trade-off. +- `dlq.md § Metric: dlq_written` — alerting on `nacked_terminal` vs + `dlq_written` divergence. +- `dlq.md § Retention` — partition + cron-prune pattern. +- `observability.md § Consume vs publish label set` — PromQL playbook + queries, including the "lease_ttl_seconds is too low" operator + signal. +- `schema-validation.md § Where to call it` — `/health` vs startup + hook reasoning. + +That assembly is exactly the cost the Production Checklist removes. + +The **Troubleshooting** page is the matching artifact for incident +response. The library emits structured `extra={"event": ..., "phase": +..., "row_id": ..., "queue": ..., "deliveries_count": ...}` fields on +WARNING-level logs (the lease-lost path is the canonical example). The +operator's log aggregator surfaces these. Today there is no page to +link from those WARNINGs that says "this means your `lease_ttl_seconds` +is too low for the handler's P99 — see §X for the tuning guide." + +The **Alembic migrations** page is the only piece without an existing +home in the docs. The Basic-usage page says "the package never creates +or migrates — that's Alembic's job" but never shows what Alembic +should generate. A literal autogenerate-output sample is ~30 lines and +saves the new adopter a half-hour of figuring out which indexes to +include, what the partial-index predicates look like, and how the +`String(64)` columns are declared. + +## Non-goals + +Deliberately *not* covered here; each is a candidate follow-on: + +- **Migration-recipe regression tests** in `tests/integration.py`. The + spike below produces the literal autogenerate output once; an + integration test that runs migrations against the postgres fixture + and asserts `validate_schema()` is clean would pin it against drift. + Strong follow-up, but adds test-suite scope; deferred. + +- **Performance / benchmarking page.** No representative public data to + share. A "what throughput can I expect?" page that hedges everything + is worse than no page. + +- **Incident postmortem template.** The library has no customer-incident + track record yet. A template without examples is just bureaucracy. + +- **Promoting `planning/architecture/` content into user-facing docs.** + Several `architecture/*.md` deep-dives (relay, timers, DLQ, drain, + metrics, test broker) would inform operators too, but they currently + also encode design rationale not relevant to operators. Promotion + would require splitting each. Separate spec. + +- **Runbook generator from frontmatter.** Three hand-maintained pages + are fine at the operator-page-count we have; automate later if it + grows. + +- **Rewriting any existing prose.** This spec adds pages and links into + them. No content currently in `docs/usage/` or + `docs/introduction/` is rewritten; the existing reference pages stay + the authoritative source on their topic. + +## Design + +### 1. New `docs/operations/` directory + nav section + +Create `docs/operations/` and add a fifth nav section to `mkdocs.yml`: + +```yaml +nav: + - Overview: index.md + - Getting started: ... + - Concepts: ... + - Guides: ... + - Reference: ... + - Operations: + - Production checklist: operations/checklist.md + - Troubleshooting: operations/troubleshooting.md + - Alembic migrations: operations/alembic.md +``` + +Material theme's expanded sidebar handles five sections cleanly — the +section count is comparable to SQLAlchemy 2.0's docs site. The four +existing sections from the `docs-landing-and-comparison` PR are +unchanged. + +The `docs/index.md` decision-tree table is updated with one new row +pointing at the checklist: + +| If you want to… | Start at | +|---|---| +| Deploy to production safely | [Production checklist](operations/checklist.md) | + +(Inserted before the existing "Install and write the first publisher / +subscriber" row.) + +### 2. Production checklist (`docs/operations/checklist.md`) + +Scannable scaffold. Each item is one to two lines plus a link into the +existing reference page that owns the full story. **No new prose for +the underlying decisions** — the page exists to make sure operators +don't *miss* the references, not to replace them. + +Sections, in order: + +**Sizing** + +- [ ] **Engine pool ≥ `Σ subs × (max_workers + 1)`** — every + subscriber holds `max_workers + 1` SQLAlchemy connections + (one writer per worker + one fetch) plus one raw asyncpg + connection for LISTEN. Sub-budget formula in [Subscriber § + Connection budget](../usage/subscriber.md#connection-budget). +- [ ] **Postgres `max_connections` ≥ `replicas × Σ subs × (max_workers + 1)`** + — the formula is per-process; rolling deploys multiply it. + Failure mode: pods refuse with `FATAL: too many connections`. + +**Subscribers** + +- [ ] **`lease_ttl_seconds` > handler P99 with margin** — otherwise + healthy in-flight handlers race their own lease expiry. The + lease cutoff is server-side `make_interval(...)`, immune to + clock skew. Tuning: [Subscriber § Slow handlers — dedicated + queue](../usage/subscriber.md#slow-handlers-dedicated-queue). +- [ ] **Slow handlers segregated** onto their own subscriber with a + taller `lease_ttl_seconds`. Don't raise it globally — that + delays reclaim of *actually* stuck rows everywhere. +- [ ] **`max_deliveries` set** (or knowingly unbounded). Defaults to + unbounded; pair with a non-`NoRetry()` retry strategy or wedge- + prone handlers can replay forever. +- [ ] **Retry strategy chosen.** Default + `ExponentialRetry(initial=1, multiplier=2, max=300, attempts=10, + jitter=0.2)` is fine for most. Opt into `NoRetry()` explicitly + for an audit feed. + +**DLQ** + +- [ ] **`dlq_table=` configured** — opt-in but recommended for any + service where terminal failures need forensic recovery. +- [ ] **Alert on `nacked_terminal` rate vs `dlq_written` divergence** + — persistent divergence means either DLQ schema drift (CTE + rolls back) or `lease_ttl_seconds` too low. See [DLQ § Metric: + dlq_written](../usage/dlq.md#metric-dlq_written). +- [ ] **DLQ retention plan.** Partition by `failed_at` + cron-drop + old partitions, or a simple `DELETE … WHERE failed_at < + interval` cron for low volume. + +**Drain & lifecycle** + +- [ ] **`graceful_timeout` ≥ handler P99 + margin** — otherwise + `OutboxSubscriber.stop()` cancels in-flight work and rows are + reclaimed mid-handler. +- [ ] **K8s `terminationGracePeriodSeconds` ≥ broker + `graceful_timeout` × parallel-subscriber-count factor** — the + broker gathers subscriber drains in parallel, but k8s SIGKILLs + after the grace period regardless. + +**Schema** + +- [ ] **`/health` calls `validate_schema()`** — opt-in, requires + `[validate]` extra. Do **not** call at `broker.start()` — that + would crash-loop on a pending migration. See [Schema validation + § Where to call it](../usage/schema-validation.md#where-to-call-it). +- [ ] **Outbox `table_name` ≤ ~56 chars** — NOTIFY channel name is + `outbox_`. Postgres' 63-char identifier limit + silently truncates longer names and `LISTEN/NOTIFY` + short-circuit degrades to plain polling. + +**Observability** + +- [ ] **`metrics_recorder` set, native middleware registered, or + both** — the recommended setup is both. See + [Observability § Layering](../usage/observability.md#layering-middleware-seam-vs-recorder-seam). +- [ ] **Alert on `lease_lost` rate** — non-zero means + `lease_ttl_seconds < handler P99` for at least one subscriber. +- [ ] **`LISTEN/NOTIFY` fallback warning checked at startup** — if + the asyncpg connection fails (driver missing, permission error), + the subscriber logs once and falls back to polling. Operator + silently lives with up-to-`max_fetch_interval` idle latency + otherwise. + +### 3. Troubleshooting (`docs/operations/troubleshooting.md`) + +Symptom → likely cause → fix, with a table-of-contents table at the +top for fast jump: + +| Symptom | Likely cause | +|---|---| +| `event=lease_lost` recurring in logs | Handler P99 > `lease_ttl_seconds` | +| Outbox row count grows + `lease_lost` spike | DLQ CTE failing (DLQ schema drift) | +| Outbox row count grows, no `lease_lost` | Fetch loop not running, or rows future-dated | +| Idle dispatch latency > `max_fetch_interval` | LISTEN setup failed → polling fallback | +| Subscriber blocks at `broker.start()` | Engine pool exhausted on writer-connection checkout | +| Duplicate handler invocations | Lease expired before handler returned, or handler not idempotent | +| Rolling deploy leaks rows | `graceful_timeout` < handler P99, or k8s grace too short | +| `activate_in` / `activate_at` fires immediately in tests | `TestOutboxBroker(run_loops=False)` ignores scheduling | +| `AckPolicy.ACK_FIRST` raises `ValueError` at registration | By design (would defeat outbox reliability) | +| `OutboxResponse(...)` + foreign-publisher decorator gets nacked | By design (dual-fire footgun, raised via dispatch overrides) | +| `validate_schema()` raises `ImportError` | `[validate]` extra not installed | + +Each row is a `##` subsection below, in the same order. Per subsection: + +- **Symptom** — what the operator sees, exactly. Log line, metric shape, + user complaint. +- **Likely cause** — the load-bearing invariant or knob that produced it. +- **Diagnose** — the command, metric, or log query that confirms. +- **Fix** — the knob to turn or code change to make. +- **Reference** — link into the relevant section of the existing + reference pages. + +Example (lease_lost): + +> ### `event=lease_lost` recurring in logs +> +> **Symptom.** WARNING-level logs with structured field `event=lease_lost`, +> typically with `phase=terminal` or `phase=retry`. One per affected row. +> +> **Likely cause.** The subscriber's `lease_ttl_seconds` is shorter than +> the handler's P99 duration. A handler took longer than the lease, +> another fetch reclaimed the row mid-flight, and the original handler's +> terminal `DELETE`/`UPDATE` matched zero rows. +> +> **Diagnose.** Grep for `event=lease_lost` over the last hour and +> compare the rate against `dispatched`. A non-zero baseline rate +> (rather than occasional spikes) confirms TTL is the issue. +> +> **Fix.** Raise `lease_ttl_seconds` for the affected subscriber, OR +> segregate slow work onto its own subscriber with a taller TTL +> (recommended — see [Subscriber § Slow +> handlers](../usage/subscriber.md#slow-handlers-dedicated-queue)). +> Pick TTL > handler P99 with margin for clock-skew tolerance. + +All ten symptoms get the same five-field shape. Total page length ~400 +lines. + +### 4. Alembic migrations (`docs/operations/alembic.md`) + +Sections: + +**4a. Initial migration.** What `alembic revision --autogenerate` +produces against a `MetaData` containing only `make_outbox_table()`. + +The spec phase includes a **spike**: run autogenerate against a clean +postgres + `make_outbox_table(metadata, table_name="outbox")`, capture +the literal output, paste it into this section verbatim under +"What you get" and annotate inline why each piece exists. The spike is +done by the spec author *before* writing this section; the resulting +sample lives in the spec body as the canonical reference (so the plan +author doesn't have to re-run it). + +**Captured autogenerate output** (SQLAlchemy 2.0.50, Alembic 1.18.4, +Postgres 17, against an empty schema with a single +`make_outbox_table(metadata, table_name="outbox")` in `target_metadata`): + +```python +# ### commands auto generated by Alembic - please adjust! ### +op.create_table('outbox', + sa.Column('id', sa.BigInteger(), autoincrement=True, nullable=False), + sa.Column('queue', sa.String(length=255), nullable=False), + sa.Column('payload', sa.LargeBinary(), nullable=False), + sa.Column('headers', postgresql.JSONB(astext_type=Text()), nullable=True), + sa.Column('attempts_count', sa.BigInteger(), server_default='0', nullable=False), + sa.Column('deliveries_count', sa.BigInteger(), server_default='0', nullable=False), + sa.Column('created_at', sa.DateTime(timezone=True), server_default=sa.text('now()'), nullable=False), + sa.Column('next_attempt_at', sa.DateTime(timezone=True), server_default=sa.text('now()'), nullable=False), + sa.Column('first_attempt_at', sa.DateTime(timezone=True), nullable=True), + sa.Column('last_attempt_at', sa.DateTime(timezone=True), nullable=True), + sa.Column('acquired_at', sa.DateTime(timezone=True), nullable=True), + sa.Column('acquired_token', sa.Uuid(), nullable=True), + sa.Column('timer_id', sa.String(length=255), nullable=True), + sa.PrimaryKeyConstraint('id') +) +op.create_index('outbox_lease_idx', 'outbox', ['queue', 'acquired_at'], unique=False, + postgresql_where=sa.text('acquired_token IS NOT NULL')) +op.create_index('outbox_pending_idx', 'outbox', ['queue', 'next_attempt_at'], unique=False, + postgresql_where=sa.text('acquired_token IS NULL')) +op.create_index('outbox_timer_id_uq', 'outbox', ['queue', 'timer_id'], unique=True, + postgresql_where=sa.text('timer_id IS NOT NULL')) +# ### end Alembic commands ### +``` + +Verified against spec prediction: three partial indexes with predicates +`acquired_token IS NULL`, `acquired_token IS NOT NULL`, and +`timer_id IS NOT NULL` — all match. Index names (`outbox_pending_idx`, +`outbox_lease_idx`, `outbox_timer_id_uq`) come from +`make_outbox_table` itself. + +The annotation in the operator page explains that **each partial +index's predicate is load-bearing** — Postgres only uses the index +when the query implies the predicate, and the fetch CTE's WHERE clause +is written to do exactly that. An operator who drops a +"redundant-looking" index makes the CTE fall back to seq-scan as the +table grows. + +Two columns are present that the user-facing reference pages don't +name explicitly (`attempts_count`, `first_attempt_at`, +`last_attempt_at`): book-keeping fields the broker uses internally to +distinguish "attempts" (handler invocations) from "deliveries" (claims +including expired-lease re-claims). The operator page mentions them in +passing but does not editorialize — they're bit-for-bit faithful copies +of `make_outbox_table`'s declaration. + +**4b. Adding the DLQ after the fact.** When you introduce +`dlq_table=make_dlq_table(metadata)` to your `MetaData` later, +autogenerate produces a second migration. **Captured output** (same +versions as §4a, against a schema that already has the outbox table, +with `make_dlq_table(metadata, table_name="outbox_dlq")` added to +`target_metadata`): + +```python +# ### commands auto generated by Alembic - please adjust! ### +op.create_table('outbox_dlq', + sa.Column('id', sa.BigInteger(), autoincrement=True, nullable=False), + sa.Column('original_id', sa.BigInteger(), nullable=False), + sa.Column('queue', sa.String(length=255), nullable=False), + sa.Column('payload', sa.LargeBinary(), nullable=False), + sa.Column('headers', postgresql.JSONB(astext_type=Text()), nullable=True), + sa.Column('deliveries_count', sa.BigInteger(), nullable=False), + sa.Column('created_at', sa.DateTime(timezone=True), nullable=False), + sa.Column('failed_at', sa.DateTime(timezone=True), server_default=sa.text('now()'), nullable=False), + sa.Column('failure_reason', sa.String(length=64), nullable=False), + sa.Column('last_exception', sa.String(), nullable=True), + sa.PrimaryKeyConstraint('id') +) +op.create_index('outbox_dlq_queue_failed_idx', 'outbox_dlq', ['queue', 'failed_at'], unique=False) +# ### end Alembic commands ### +``` + +This is purely additive — no `op.alter_table` against the outbox +table itself. The existing outbox path stays bit-for-bit identical; +only the terminal-flush statement changes to the CTE shape (a runtime +decision driven by `OutboxBroker.dlq_table`, not a schema change). + +Index `outbox_dlq_queue_failed_idx` is non-unique on `(queue, +failed_at)` — supports "recent failures for queue X" queries and +double-duties as the partition-pruning index when the page §4d +walkthrough converts the DLQ to partitioned. + +**4c. Drift detection in CI.** A small standalone script using +`validate_schema()`: + +```python +import asyncio +from faststream_outbox import OutboxBroker, make_outbox_table + +async def main() -> None: + broker = OutboxBroker(engine, outbox_table=outbox_table) + await broker.validate_schema() + +asyncio.run(main()) +``` + +Run after `alembic upgrade head` in CI; non-zero exit on drift. + +The page explains why this is **opt-in for `/health`** and not always- +on at startup: a running migration plus an always-on validator races — +operators must be able to roll forward a new schema version without +spinning every pod into a crash loop. The drift check belongs in CI +between `alembic upgrade head` and "deploy". + +**4d. DLQ retention via partition drop.** A walkthrough of converting +the DLQ from a plain table to a partitioned-by-`failed_at` shape, with +the literal Alembic ops for: + +- Renaming the existing table out of the way. +- Creating the partitioned parent with `PARTITION BY RANGE (failed_at)`. +- Creating initial partitions. +- Copying rows from the old table into the partitions. +- A monthly cron script (raw SQL, not Alembic) that creates next + month's partition and drops the partition older than the retention + window. + +This is the operator-facing version of the +[DLQ § Retention](../usage/dlq.md#retention) paragraph, which today +gestures at the pattern without showing the SQL. + +### 5. Cross-links from existing pages + +Each existing page that the operator pages link *from* gets a one-line +"see also" callout pointing at the relevant operator-page section. +Symmetry with the cross-links the `docs-landing-and-comparison` PR added +into the Comparison page. + +| Existing page section | New callout | +|---|---| +| `subscriber.md § Connection budget` | "Operator-side: [Production checklist § Sizing](../operations/checklist.md#sizing)." | +| `subscriber.md § Slow handlers — dedicated queue` | "See also [Troubleshooting § event=lease_lost](../operations/troubleshooting.md#event-lease_lost-recurring-in-logs)." | +| `dlq.md § Metric: dlq_written` | "Operator playbook: [Production checklist § DLQ](../operations/checklist.md#dlq)." | +| `dlq.md § Retention` | "Step-by-step: [Alembic migrations § DLQ retention via partition drop](../operations/alembic.md#dlq-retention-via-partition-drop)." | +| `schema-validation.md § Where to call it` | "CI recipe: [Alembic migrations § Drift detection in CI](../operations/alembic.md#drift-detection-in-ci)." | + +Five callouts. No prose rewritten. + +## Operations + +None — fully in-repo. The mkdocs deploy workflow re-runs on push to +`main` whenever `docs/**` or `mkdocs.yml` changes, both of which this +spec triggers. The new URLs become available immediately: + +- `https://faststream-outbox.modern-python.org/operations/checklist/` +- `https://faststream-outbox.modern-python.org/operations/troubleshooting/` +- `https://faststream-outbox.modern-python.org/operations/alembic/` + +## Out of scope (repeat list) + +Already named under Non-goals; repeated here for grep: + +- Migration-recipe regression tests in `tests/integration.py` +- Performance / benchmarking page +- Incident postmortem template +- Promoting `planning/architecture/` content into user-facing docs +- Runbook generator from frontmatter +- Rewriting any existing prose + +## Testing + +Content + nav config; correctness is checked by: + +- `just docs-build` (the new `mkdocs build --strict` target) passes + locally and in the deploy workflow. Catches broken cross-links from + the five existing-page callouts and from inside the new pages. +- `just lint` passes (markdown EOF + YAML formatting on + `mkdocs.yml`). +- Spot-check on PR preview that all eight sub-sections of the + Production checklist render with link targets, that the + Troubleshooting TOC table jumps to the right sub-section anchors, + and that the Alembic autogenerate sample renders inside its code + block. +- The Alembic spike output (captured during the spec phase below) is + consistent with the model declarations in + `src/faststream_outbox/store/schema.py` — checked by re-running + autogenerate once at the start of the implementation plan, before + pasting into `alembic.md`, against a fresh postgres. + +## Risk + +- **Alembic autogenerate output drifts** between when the spec spike + captured it and when an operator runs it. SQLAlchemy and Alembic + evolve their autogenerate diffs over time. Mitigated by the + follow-up regression test (out of scope here but called out as the + natural next step). For now: the spike output is captured as a + *representative* sample; the page calls that out so an operator + knows to compare to their own autogenerate. + +- **Checklist becomes stale** as subscriber options, default values, + or new metric events ship. Mitigated by item-level cross-links: when + someone changes `lease_ttl_seconds` default, they touch + `subscriber.md § Slow handlers`, the docs review surfaces the + cross-link, and they update the checklist line at the same time. + This is the same hygiene that `planning/architecture/` already + depends on. + +- **Troubleshooting page becomes prescriptive of fixes that mask + underlying problems** (e.g. "raise `lease_ttl_seconds`" when the + real issue is that the handler is doing too much work). Mitigated by + always linking each fix back to the design rationale in the + reference / concept page — the operator sees the principle, not just + the knob. + +- **Nav grows to six sections later** (Operations + a hypothetical + Recipes, Migration guides, etc.). Material handles six sections, but + the sidebar starts feeling busy. Acceptable risk; revisit if it + happens. + +- **`docs/operations/` collides with mkdocs-material reserved paths.** + None known; Material's reserved paths are theme partials under + `overrides/`. Confirmed by visual inspection of existing + `mkdocs.yml` and the lack of any `overrides:` block. Spike: trial + build with one empty placeholder file catches a collision at + `mkdocs build --strict` time. diff --git a/planning/active/2026-06-11-operator-pages-plan.md b/planning/active/2026-06-11-operator-pages-plan.md new file mode 100644 index 0000000..14b2cbf --- /dev/null +++ b/planning/active/2026-06-11-operator-pages-plan.md @@ -0,0 +1,471 @@ +--- +status: draft +date: 2026-06-11 +slug: operator-pages +spec: operator-pages +pr: null +--- + +# operator-pages — implementation plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use +> superpowers:subagent-driven-development (recommended) or +> superpowers:executing-plans to implement this plan task-by-task. Steps +> use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Add three operator-facing pages (Production checklist, +Troubleshooting playbook, Alembic migrations) under a new +`docs/operations/` directory and a fifth top-level **Operations** nav +section, with five one-line cross-link callouts from existing reference +pages. + +**Spec:** [`planning/active/2026-06-11-operator-pages-design.md`](./2026-06-11-operator-pages-design.md) + +**Branch:** `docs/operator-pages` + +**Commit strategy:** Per-task commits. Tasks 2–7 each produce a +commit and each leaves the docs site `--strict`-buildable. Task 8 is +verification-only. + +--- + +### Task 1: Branch + commit spec, plan, and README index update + +**Files:** +- Create: `planning/active/2026-06-11-operator-pages-design.md` (already drafted in working tree) +- Create: `planning/active/2026-06-11-operator-pages-plan.md` (this file, already drafted) +- Modify: `planning/README.md` + +Land the planning artifacts and surface them in the index before +touching any docs content. + +- [ ] **Step 1: Create the feature branch from `main`** + + Run: `git switch -c docs/operator-pages` + Expected: `Switched to a new branch 'docs/operator-pages'`. + +- [ ] **Step 2: Update `planning/README.md`** + + Replace the `## Active` block. Current state (post-#52): + + ```markdown + ## Active + + _None._ + ``` + + Becomes: + + ```markdown + ## Active + + - **[operator-pages](active/2026-06-11-operator-pages-design.md)** + — Three new pages under a new `docs/operations/` section: + Production checklist, Troubleshooting playbook, Alembic + migrations. The B follow-on from #50. + ``` + +- [ ] **Step 3: Commit** + + ```bash + git add planning/active/2026-06-11-operator-pages-design.md \ + planning/active/2026-06-11-operator-pages-plan.md \ + planning/README.md + git commit -m "docs: spec + plan for operator pages + + Co-Authored-By: Claude Opus 4.7 (1M context) " + ``` + +--- + +### Task 2: Alembic autogenerate spike → update spec + +**Files:** +- Modify: `planning/active/2026-06-11-operator-pages-design.md` (paste captured output into §4a) + +The spec §4a calls for the **literal `alembic revision --autogenerate` +output** against `make_outbox_table()`. Capture it now so Task 6 has a +verbatim sample to paste, and so the spec stays the canonical reference +for what autogenerate produces. + +The spike can be done two ways; pick whichever you find faster. + +**Option A — programmatic.** Write a one-off Python script under +`/tmp/` that uses Alembic's autogenerate APIs (the same APIs +`broker.validate_schema()` already uses internally — see +`src/faststream_outbox/_schema_validation.py`). Construct a `MetaData` +holding `make_outbox_table(metadata, table_name="outbox")`, attach to a +running Postgres connection, call `produce_migrations` / +`render_python_code`, print the rendered upgrade ops. + +**Option B — actual Alembic env.** Set up a minimal Alembic project +under `/tmp/alembic-spike/`: `alembic.ini`, `env.py` that imports +`make_outbox_table` and sets `target_metadata`, run +`alembic revision --autogenerate -m "initial"` against an empty +Postgres, open the generated migration file. + +Either way, Postgres must be running. The repo's `compose.yml` already +provides one: `docker compose up -d postgres` (port 5432). + +- [ ] **Step 1: Start Postgres if not running** + + Run: `docker compose up -d postgres` + Expected: container ready; `pg_isready` succeeds on `localhost:5432`. + +- [ ] **Step 2: Run the spike (Option A or B)** + + Capture the rendered Python upgrade ops (`op.create_table(...)`, + `op.create_index(...)` calls) plus their column lists and index + predicates. + +- [ ] **Step 3: Paste into spec §4a** + + In `planning/active/2026-06-11-operator-pages-design.md` §4a + "Initial migration", under a new "Captured autogenerate output" + subsection, paste the literal rendered Python verbatim inside a + ` ```python ` block. Add a one-line preamble noting the + SQLAlchemy + Alembic versions used (read from + `uv pip list | grep -E "sqlalchemy|alembic"`). + +- [ ] **Step 4: Verify the partial-index predicates match the spec's + prediction** + + The spec predicts three partial indexes: + - `(queue, next_attempt_at) WHERE acquired_token IS NULL` + - `(queue, acquired_at) WHERE acquired_token IS NOT NULL` + - unique `(queue, timer_id) WHERE timer_id IS NOT NULL` + + If the captured output differs (extra index, different predicate), + STOP and flag — the spec's §4a section needs updating before Task 6 + can write the page truthfully. + +- [ ] **Step 5: Commit** + + ```bash + git add planning/active/2026-06-11-operator-pages-design.md + git commit -m "spec: capture alembic autogenerate output for operator-pages + + Co-Authored-By: Claude Opus 4.7 (1M context) " + ``` + +--- + +### Task 3: New Operations nav section + three stub pages + +**Files:** +- Create: `docs/operations/checklist.md` (stub) +- Create: `docs/operations/troubleshooting.md` (stub) +- Create: `docs/operations/alembic.md` (stub) +- Modify: `mkdocs.yml` + +Land the new directory, three stub files (H1 only), and the new +nav section all at once. `mkdocs build --strict` succeeds because +each file in the nav exists; the stubs render as nearly-empty pages +but that is fine until Tasks 4–6 fill them in. + +- [ ] **Step 1: Create three stub files** + + Each contains nothing but the H1 expected for the page: + + - `docs/operations/checklist.md` → `# Production checklist` + - `docs/operations/troubleshooting.md` → `# Troubleshooting` + - `docs/operations/alembic.md` → `# Alembic migrations` + +- [ ] **Step 2: Add the Operations nav section to `mkdocs.yml`** + + Append after the existing `Reference:` block (per spec §1): + + ```yaml + - Operations: + - Production checklist: operations/checklist.md + - Troubleshooting: operations/troubleshooting.md + - Alembic migrations: operations/alembic.md + ``` + +- [ ] **Step 3: Smoke-build** + + Run: `just docs-build` + Expected: clean. Five sections in sidebar, all three new pages + present (as near-empty stubs). + +- [ ] **Step 4: Commit** + + ```bash + git add docs/operations/ mkdocs.yml + git commit -m "docs: scaffold Operations nav section and three operator pages + + Co-Authored-By: Claude Opus 4.7 (1M context) " + ``` + +--- + +### Task 4: Production checklist content + +**Files:** +- Modify: `docs/operations/checklist.md` + +Write the full checklist per [spec §2 +](./2026-06-11-operator-pages-design.md#2-production-checklist-docsoperationschecklistmd). +Six sections in this exact order: Sizing → Subscribers → DLQ → Drain & +lifecycle → Schema → Observability. Each item is a checkbox bullet +(`- [ ] **...**`) of one-to-two lines plus a relative link into the +existing reference page that owns the underlying detail. + +The spec § lists all sixteen items verbatim; transcribe them rather +than inventing new ones. The page's purpose is to surface existing +references, not to re-document them. + +- [ ] **Step 1: Write `docs/operations/checklist.md`** + + Header structure: + + ```markdown + # Production checklist + + Scannable scaffold of pre-launch checks. Each item is one to two + lines; the link points at the existing reference page that owns the + full story. + + ## Sizing + ... + ## Subscribers + ... + ## DLQ + ... + ## Drain & lifecycle + ... + ## Schema + ... + ## Observability + ... + ``` + +- [ ] **Step 2: Smoke-build** + + Run: `just docs-build` + Expected: clean. All `../usage/...` relative links resolve. + +- [ ] **Step 3: Commit** + + ```bash + git add docs/operations/checklist.md + git commit -m "docs: production checklist content + + Co-Authored-By: Claude Opus 4.7 (1M context) " + ``` + +--- + +### Task 5: Troubleshooting playbook content + +**Files:** +- Modify: `docs/operations/troubleshooting.md` + +Write the full playbook per [spec §3 +](./2026-06-11-operator-pages-design.md#3-troubleshooting-docsoperationstroubleshootingmd). +Eleven symptoms in this order: + +1. `event=lease_lost` recurring in logs +2. Outbox row count grows + `lease_lost` spike +3. Outbox row count grows, no `lease_lost` +4. Idle dispatch latency > `max_fetch_interval` +5. Subscriber blocks at `broker.start()` +6. Duplicate handler invocations +7. Rolling deploy leaks rows +8. `activate_in` / `activate_at` fires immediately in tests +9. `AckPolicy.ACK_FIRST` raises `ValueError` at registration +10. `OutboxResponse(...)` + foreign-publisher decorator gets nacked +11. `validate_schema()` raises `ImportError` + +- [ ] **Step 1: Write the TOC table at the top** + + Two-column table (Symptom | Likely cause) per spec §3. Each row is + also an anchor link to the `##` heading below. + +- [ ] **Step 2: Write each `##` subsection** + + Use the five-field shape per spec (Symptom / Likely cause / + Diagnose / Fix / Reference). The spec's worked example for + `event=lease_lost` is the canonical template; match its tone and + field order for the other ten. + + Reference fields link into the existing reference pages — keep + links relative (`../usage/...`, `../introduction/...`). + +- [ ] **Step 3: Smoke-build** + + Run: `just docs-build` + Expected: clean. TOC table anchors resolve to the `##` headings + below. + +- [ ] **Step 4: Commit** + + ```bash + git add docs/operations/troubleshooting.md + git commit -m "docs: troubleshooting playbook content + + Co-Authored-By: Claude Opus 4.7 (1M context) " + ``` + +--- + +### Task 6: Alembic migrations page content + +**Files:** +- Modify: `docs/operations/alembic.md` + +Write the full page per [spec §4 +](./2026-06-11-operator-pages-design.md#4-alembic-migrations-docsoperationsalembicmd). +Four sections: + +- §4a Initial migration — paste the **captured autogenerate output + from Task 2** verbatim, then annotate inline why each piece (table, + partial index, partial unique index, column type) exists. The + annotations explain that each partial index's predicate is + load-bearing for fetch performance. +- §4b Adding the DLQ after the fact — describe the additive nature + (only `create_table` + a single non-unique index; no `alter_table` + on the outbox), and show the analogous autogenerate output for + `make_dlq_table(metadata, table_name="outbox_dlq")` (re-run the + spike for this case during this task, or include a "run + autogenerate against your `MetaData` to see the exact output for + your DLQ table name" placeholder if a re-spike is impractical). +- §4c Drift detection in CI — show the small standalone + `validate_schema()` script from spec §4c verbatim. Explain why this + is opt-in for `/health` and belongs between `alembic upgrade head` + and the deploy step in CI. +- §4d DLQ retention via partition drop — walk through the Alembic + ops for converting the DLQ from a plain table to one partitioned + by `failed_at`, plus the monthly cron SQL for creating next + month's partition and dropping the oldest. + +- [ ] **Step 1: Write the page** + + Use level-2 headings for the four sub-sections so they appear in + the page TOC. Code blocks for Alembic op samples and the CI + script. + +- [ ] **Step 2: Re-run the autogenerate spike for `make_dlq_table`** + + Same mechanism as Task 2, but with `make_dlq_table(metadata, + table_name="outbox_dlq")` added to the metadata. Capture the + additional `create_table("outbox_dlq", ...)` + `create_index(...)` + ops. Paste into §4b. + + If §4b's autogenerate sample is impractical to capture for any + reason (Postgres unavailable, etc.), STOP and flag — the page + needs the verbatim sample to be useful. + +- [ ] **Step 3: Smoke-build** + + Run: `just docs-build` + Expected: clean. + +- [ ] **Step 4: Commit** + + ```bash + git add docs/operations/alembic.md + git commit -m "docs: alembic migrations page content + + Co-Authored-By: Claude Opus 4.7 (1M context) " + ``` + +--- + +### Task 7: Index decision-tree row + five cross-link callouts + +**Files:** +- Modify: `docs/index.md` +- Modify: `docs/usage/subscriber.md` +- Modify: `docs/usage/dlq.md` +- Modify: `docs/usage/schema-validation.md` + +Add the one new row to the landing page's decision-tree table per +spec §1, and the five one-line "see also" callouts per spec §5. + +- [ ] **Step 1: Add the decision-tree row in `docs/index.md`** + + Insert before the existing "Install and write the first publisher / + subscriber" row: + + ```markdown + | Deploy to production safely | [Production checklist](operations/checklist.md) | + ``` + +- [ ] **Step 2: Add the five callouts** per the spec §5 table: + + | Existing page | Section | Callout text | + |---|---|---| + | `docs/usage/subscriber.md` | Connection budget (end) | `_Operator-side: [Production checklist § Sizing](../operations/checklist.md#sizing)._` | + | `docs/usage/subscriber.md` | Slow handlers — dedicated queue (end) | `_See also [Troubleshooting § event=lease_lost](../operations/troubleshooting.md#event-lease_lost-recurring-in-logs)._` | + | `docs/usage/dlq.md` | Metric: dlq_written (end) | `_Operator playbook: [Production checklist § DLQ](../operations/checklist.md#dlq)._` | + | `docs/usage/dlq.md` | Retention (end) | `_Step-by-step: [Alembic migrations § DLQ retention via partition drop](../operations/alembic.md#dlq-retention-via-partition-drop)._` | + | `docs/usage/schema-validation.md` | Where to call it (end) | `_CI recipe: [Alembic migrations § Drift detection in CI](../operations/alembic.md#drift-detection-in-ci)._` | + + All callouts use italics + en-dash style for consistency with the + Comparison callouts the `docs-landing-and-comparison` PR (#50) + landed. + +- [ ] **Step 3: Smoke-build** + + Run: `just docs-build` + Expected: clean. All five callout cross-links resolve to anchors + inside the new operator pages (Tasks 4–6). + +- [ ] **Step 4: Commit** + + ```bash + git add docs/index.md docs/usage/subscriber.md docs/usage/dlq.md docs/usage/schema-validation.md + git commit -m "docs: cross-link callouts from existing pages into operator pages + + Co-Authored-By: Claude Opus 4.7 (1M context) " + ``` + +--- + +### Task 8: Verify + +**Files:** none modified; no commit produced. + +- [ ] **Step 1: Full strict build** + + Run: `just docs-build` + Expected: clean. + +- [ ] **Step 2: Lint pass** + + Run: `just lint` + Expected: `eof-fixer`, `ruff format`, `ruff check`, `ty check` + all pass. Markdown EOF + YAML formatting on `mkdocs.yml` are the + only things touched in this PR. + +- [ ] **Step 3: Manual sidebar scan** + + Run: `just docs-serve` + Open the served site. Confirm: + + - Sidebar shows five sections in order: Overview → Getting + started → Concepts → Guides → Reference → Operations. + - Operations contains three pages in order: Production checklist, + Troubleshooting, Alembic migrations. + - The decision-tree table on the landing page has the new + "Deploy to production safely" row pointing at the checklist. + - The Comparison page (from #50) is still present and reachable + under Concepts. + - All eleven Troubleshooting TOC table entries scroll-to the + matching `##` heading on click. + - All sixteen Production-checklist items render with working + relative links. + - The Alembic page's autogenerate code blocks render verbatim with + the right partial-index predicates (matches what Task 2 captured + in the spec). + +- [ ] **Step 4: Open the PR** + + Stop. Hand off to `superpowers:requesting-code-review` / + `superpowers:finishing-a-development-branch` per the standard + workflow. + + On merge, both halves of the pair move to `planning/archived/` and + get `status: shipped`, `pr:`, and `outcome:` filled — same + archive-PR pattern that #52 dogfooded.