Skip to content

feat(embedder): SQLite per-utterance embedding cache#340

Draft
voorhs wants to merge 10 commits into
devfrom
worktree-sqlite-embedding-cache
Draft

feat(embedder): SQLite per-utterance embedding cache#340
voorhs wants to merge 10 commits into
devfrom
worktree-sqlite-embedding-cache

Conversation

@voorhs

@voorhs voorhs commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

Replaces the .npy-file-per-call embedding cache with a single SQLite database keyed per utterance, and lifts the triplicated cache code out of the three backends into one template method on BaseEmbeddingBackend.

Why: the old cache keyed on hash(model + entire_utterance_list + prompt), so any change to a list (reorder/add/drop one item) was a full miss, and every distinct list became its own .npy file (inode growth, no atomicity, no concurrency story). Per-utterance keying stores each (model, utterance, prompt) once — overlapping calls now reuse the overlap.

Honest scoping: SQLite does not make warm cache hits faster (they were already sub-ms). The wins are correctness (atomic writes), operability (one file, concurrency, eviction groundwork), and enabling per-utterance keys without an inode explosion.

Scope (decided with maintainer up front)

  • Embedding cache only — the structured-output / LLM cache (generation/_cache.py) is untouched.
  • Per-utterance keying — one row per (model, utterance, prompt), float32 BLOB.
  • Eviction = groundwork onlycreated_at / last_accessed / size_bytes / model_hash columns + indexes; cache stays unbounded by default (no behavior change).
  • Fresh start — old .npy caches are left as orphans (true migration is infeasible; old keys are list-hashes). One-time recompute on first run.
  • AUTOINTENT_CACHE_DIR env var to relocate the cache (defaults to the OS cache dir).

What changed

  • New autointent/_cache_dir.pyget_cache_dir() (env var + appdirs fallback).
  • New autointent/_wrappers/embedder/_sqlite_cache.pySQLiteEmbeddingCache (WAL, busy_timeout, BEGIN IMMEDIATE schema init with post-lock version re-read, INSERT OR IGNORE, model_hash read filter, chunked IN (...), graceful degradation to recompute on any cache I/O error), utterance_key, get_embedding_cache.
  • Refactor BaseEmbeddingBackend.embed → concrete template (split hits/misses, dedup, reassemble in order); backends implement _embed_uncached. HashingVectorizer opts out via supports_cache = False (its 262k-dim vectors would be ~1 MB BLOBs, and recompute is cheap).
  • Removed embedder/utils.py (get_embeddings_path).
  • Tests: pure-Python unit tests for the store + a global cache-dir isolation fixture; integration tests proving per-utterance reuse (DB has exactly 3 rows after ["x","y"] then ["y","z"]), dedup, order preservation, and empty-input behavior.

Process / provenance

This PR includes the design spec and implementation plan under docs/superpowers/ for review. Both went through adversarial review-to-convergence (spec: 3 rounds; plan: 1 round with empirical ruff/mypy validation), plus a final whole-diff review that caught and fixed one real bug (HashingVectorizer empty input raised StopIteration under sklearn ≥1.5 — now returns (0, n_features)).

Verification

  • Local: ruff check . and mypy src/autointent tests (strict, py3.10) both green.
  • Tests: per project convention, the full pytest suite is not run locally — it runs on CI for this PR. Please check the CI status here. The new cache unit tests are pure-Python (no model downloads); the integration tests use the pinned tiny ST model.

🤖 Generated with Claude Code

voorhs and others added 10 commits June 25, 2026 23:33
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…unds)

- correct ABC config typing: base union declaration + per-subclass narrowing
- HV stays uncached via supports_cache flag (avoids ~1MB BLOBs)
- cross-process schema-rebuild via BEGIN IMMEDIATE + post-lock re-read
- broaden degradation catch to (sqlite3.Error, OSError); str() model_hash
- cross-model collision defended via model_hash filter
- global test isolation fixture; fix CHANGELOG path; ruff/mypy specifics

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- invert double-checked lock (mypy unreachable)
- correct S608 noqa placement; annotate rows / np.ndarray -> npt.NDArray
- list per-file unused imports to drop (torch/TaskTypeEnum/Literal/overload)
- TYPE_CHECKING imports in new test files; unquoted conftest annotation
- add empty-set_many + index-presence tests; soften coverage claim

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e method

Move the triplicated .npy cache block out of the ST/OpenAI/vLLM backends into a
single BaseEmbeddingBackend.embed template backed by SQLiteEmbeddingCache. Backends
now implement _embed_uncached; HashingVectorizer opts out via supports_cache=False.
Removes the obsolete utils.get_embeddings_path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
sklearn's HashingVectorizer.transform([]) raises StopIteration (>=1.5); guard
empty input to return a (0, n_features) array instead, matching the regression
test and the spec's intent.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant