feat(embedder): SQLite per-utterance embedding cache by voorhs · Pull Request #340 · deeppavlov/AutoIntent

voorhs · 2026-06-25T22:05:14Z

Summary

Replaces the .npy-file-per-call embedding cache with a single SQLite database keyed per utterance, and lifts the triplicated cache code out of the three backends into one template method on BaseEmbeddingBackend.

Why: the old cache keyed on hash(model + entire_utterance_list + prompt), so any change to a list (reorder/add/drop one item) was a full miss, and every distinct list became its own .npy file (inode growth, no atomicity, no concurrency story). Per-utterance keying stores each (model, utterance, prompt) once — overlapping calls now reuse the overlap.

Honest scoping: SQLite does not make warm cache hits faster (they were already sub-ms). The wins are correctness (atomic writes), operability (one file, concurrency, eviction groundwork), and enabling per-utterance keys without an inode explosion.

Scope (decided with maintainer up front)

✅ Embedding cache only — the structured-output / LLM cache (generation/_cache.py) is untouched.
✅ Per-utterance keying — one row per (model, utterance, prompt), float32 BLOB.
✅ Eviction = groundwork only — created_at / last_accessed / size_bytes / model_hash columns + indexes; cache stays unbounded by default (no behavior change).
✅ Fresh start — old .npy caches are left as orphans (true migration is infeasible; old keys are list-hashes). One-time recompute on first run.
✅ AUTOINTENT_CACHE_DIR env var to relocate the cache (defaults to the OS cache dir).

What changed

New autointent/_cache_dir.py — get_cache_dir() (env var + appdirs fallback).
New autointent/_wrappers/embedder/_sqlite_cache.py — SQLiteEmbeddingCache (WAL, busy_timeout, BEGIN IMMEDIATE schema init with post-lock version re-read, INSERT OR IGNORE, model_hash read filter, chunked IN (...), graceful degradation to recompute on any cache I/O error), utterance_key, get_embedding_cache.
Refactor BaseEmbeddingBackend.embed → concrete template (split hits/misses, dedup, reassemble in order); backends implement _embed_uncached. HashingVectorizer opts out via supports_cache = False (its 262k-dim vectors would be ~1 MB BLOBs, and recompute is cheap).
Removed embedder/utils.py (get_embeddings_path).
Tests: pure-Python unit tests for the store + a global cache-dir isolation fixture; integration tests proving per-utterance reuse (DB has exactly 3 rows after ["x","y"] then ["y","z"]), dedup, order preservation, and empty-input behavior.

Process / provenance

This PR includes the design spec and implementation plan under docs/superpowers/ for review. Both went through adversarial review-to-convergence (spec: 3 rounds; plan: 1 round with empirical ruff/mypy validation), plus a final whole-diff review that caught and fixed one real bug (HashingVectorizer empty input raised StopIteration under sklearn ≥1.5 — now returns (0, n_features)).

Verification

Local: ruff check . and mypy src/autointent tests (strict, py3.10) both green.
Tests: per project convention, the full pytest suite is not run locally — it runs on CI for this PR. Please check the CI status here. The new cache unit tests are pure-Python (no model downloads); the integration tests use the pinned tiny ST model.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…unds) - correct ABC config typing: base union declaration + per-subclass narrowing - HV stays uncached via supports_cache flag (avoids ~1MB BLOBs) - cross-process schema-rebuild via BEGIN IMMEDIATE + post-lock re-read - broaden degradation catch to (sqlite3.Error, OSError); str() model_hash - cross-model collision defended via model_hash filter - global test isolation fixture; fix CHANGELOG path; ruff/mypy specifics Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- invert double-checked lock (mypy unreachable) - correct S608 noqa placement; annotate rows / np.ndarray -> npt.NDArray - list per-file unused imports to drop (torch/TaskTypeEnum/Literal/overload) - TYPE_CHECKING imports in new test files; unquoted conftest annotation - add empty-set_many + index-presence tests; soften coverage claim Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…e method Move the triplicated .npy cache block out of the ST/OpenAI/vLLM backends into a single BaseEmbeddingBackend.embed template backed by SQLiteEmbeddingCache. Backends now implement _embed_uncached; HashingVectorizer opts out via supports_cache=False. Removes the obsolete utils.get_embeddings_path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

sklearn's HashingVectorizer.transform([]) raises StopIteration (>=1.5); guard empty input to return a (0, n_features) array instead, matching the regression test and the spec's intent. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

voorhs and others added 10 commits June 25, 2026 23:33

docs(spec): SQLite per-utterance embedding cache design

9ef81d7

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs(plan): SQLite embedding cache implementation plan

b34e86e

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

feat(cache): add get_cache_dir() + global embedding-cache test isolation

0d89cdb

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

feat(cache): add SQLiteEmbeddingCache per-utterance store

c399f9c

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

test(cache): cover per-utterance reuse, dedup, order, and empty input

9169c3d

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs(changelog): note SQLite embedding cache and AUTOINTENT_CACHE_DIR

77482e8

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(embedder): SQLite per-utterance embedding cache#340

feat(embedder): SQLite per-utterance embedding cache#340
voorhs wants to merge 10 commits into
devfrom
worktree-sqlite-embedding-cache

voorhs commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

voorhs commented Jun 25, 2026

Summary

Scope (decided with maintainer up front)

What changed

Process / provenance

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant