feat(embedder): SQLite per-utterance embedding cache#340
Draft
voorhs wants to merge 10 commits into
Draft
Conversation
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…unds) - correct ABC config typing: base union declaration + per-subclass narrowing - HV stays uncached via supports_cache flag (avoids ~1MB BLOBs) - cross-process schema-rebuild via BEGIN IMMEDIATE + post-lock re-read - broaden degradation catch to (sqlite3.Error, OSError); str() model_hash - cross-model collision defended via model_hash filter - global test isolation fixture; fix CHANGELOG path; ruff/mypy specifics Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- invert double-checked lock (mypy unreachable) - correct S608 noqa placement; annotate rows / np.ndarray -> npt.NDArray - list per-file unused imports to drop (torch/TaskTypeEnum/Literal/overload) - TYPE_CHECKING imports in new test files; unquoted conftest annotation - add empty-set_many + index-presence tests; soften coverage claim Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e method Move the triplicated .npy cache block out of the ST/OpenAI/vLLM backends into a single BaseEmbeddingBackend.embed template backed by SQLiteEmbeddingCache. Backends now implement _embed_uncached; HashingVectorizer opts out via supports_cache=False. Removes the obsolete utils.get_embeddings_path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
sklearn's HashingVectorizer.transform([]) raises StopIteration (>=1.5); guard empty input to return a (0, n_features) array instead, matching the regression test and the spec's intent. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the
.npy-file-per-call embedding cache with a single SQLite database keyed per utterance, and lifts the triplicated cache code out of the three backends into one template method onBaseEmbeddingBackend.Why: the old cache keyed on
hash(model + entire_utterance_list + prompt), so any change to a list (reorder/add/drop one item) was a full miss, and every distinct list became its own.npyfile (inode growth, no atomicity, no concurrency story). Per-utterance keying stores each(model, utterance, prompt)once — overlapping calls now reuse the overlap.Scope (decided with maintainer up front)
generation/_cache.py) is untouched.(model, utterance, prompt), float32 BLOB.created_at/last_accessed/size_bytes/model_hashcolumns + indexes; cache stays unbounded by default (no behavior change)..npycaches are left as orphans (true migration is infeasible; old keys are list-hashes). One-time recompute on first run.AUTOINTENT_CACHE_DIRenv var to relocate the cache (defaults to the OS cache dir).What changed
autointent/_cache_dir.py—get_cache_dir()(env var + appdirs fallback).autointent/_wrappers/embedder/_sqlite_cache.py—SQLiteEmbeddingCache(WAL,busy_timeout,BEGIN IMMEDIATEschema init with post-lock version re-read,INSERT OR IGNORE,model_hashread filter, chunkedIN (...), graceful degradation to recompute on any cache I/O error),utterance_key,get_embedding_cache.BaseEmbeddingBackend.embed→ concrete template (split hits/misses, dedup, reassemble in order); backends implement_embed_uncached.HashingVectorizeropts out viasupports_cache = False(its 262k-dim vectors would be ~1 MB BLOBs, and recompute is cheap).embedder/utils.py(get_embeddings_path).["x","y"]then["y","z"]), dedup, order preservation, and empty-input behavior.Process / provenance
This PR includes the design spec and implementation plan under
docs/superpowers/for review. Both went through adversarial review-to-convergence (spec: 3 rounds; plan: 1 round with empirical ruff/mypy validation), plus a final whole-diff review that caught and fixed one real bug (HashingVectorizer empty input raisedStopIterationunder sklearn ≥1.5 — now returns(0, n_features)).Verification
ruff check .andmypy src/autointent tests(strict, py3.10) both green.🤖 Generated with Claude Code