[RNE Rewrite] Add text and image embeddings pipelines#1292
Open
msluszniak wants to merge 7 commits into
Open
Conversation
Add int64/Long tensor dtype support and text/image embeddings tasks, hooks, and model registry entries, plus an interactive text-embeddings demo screen in apps/nlp. Closes #1247
model.execute now validates dynamically-shaped forward inputs against the model-declared [min, max, step] bounds exposed by an optional get_dynamic_dims method, instead of requiring an exact shape match; models without it keep exact per-dimension validation. Text embeddings feed the exact token length with no padding, which fixes scale-sensitive pooling heads (e.g. DistilUSE's tanh projection). Point DistilUSE at v0.10.0 (re-exported with get_dynamic_dims).
msluszniak
commented
Jul 1, 2026
…mbeddings demo - Simplify text-embeddings cosine to a dot product (all models L2-normalize) and drop redundant inline comments. - Move the get_dynamic_dims / input-validation contract into the ModelHostObject class docs; trim the inline narration in model.cpp. - Add an Image Embeddings example to the computer-vision app: pick two images and compare their CLIP embeddings by cosine similarity.
Rework the computer-vision Image Embeddings screen (based on main's CLIP demo): pick an image and rank editable text labels by CLIP image/text embedding similarity, instead of the uninformative two-image score. Pads the scroll content past the Android nav bar. Point CLIP text + image at v0.10.0 (text re-exported with get_dynamic_dims; image unchanged) and declare the textEmbeddings feature in the app.
- model.{h,cpp}: read get_dynamic_dims once per model and cache it instead
of re-executing the method on every forward() call; reject a present-but-
malformed declaration (wrong dtype/rank/shape, bad min/max/step, or row
count not matching forward's tensor input dims) with an explicit error
instead of silently falling back to exact validation.
- textEmbeddings: throw a clear error when input tokenizes to zero tokens
(was BigInt(undefined)); fix docstring to match no-padding behavior.
- useTextEmbeddings: expose localPath/tokenizerPath like sibling hooks.
- computer-vision: extract shared skImageToBuffer helper, dedup from
classification and imageEmbeddings screens.
- Use unordered_set::contains instead of count()==0 (readability-container-contains). - Keep the new dynamic-bounds cache members public so the class stays an all-public data carrier; adding private member variables had broken the non-private-member exemption and flagged the existing public members.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds text and image embeddings pipelines to the new architecture, achieving parity with the old flow. Embeddings are pure-TypeScript tasks (pooling + L2-norm stay baked into the
.pte): text tokenizes and runsforward; image reuses the existing image preprocessor. To run the existing int64-input embedding models unchanged, this adds anint64/Longtensor dtype to the core (the tensor data path is byte-oriented, so it is a smalldtype.{h,cpp}+tensor.tschange).Text inputs are fed at their exact token length (no padding).
model.executevalidates dynamically-shapedforwardinputs against the[min, max, step]bounds exposed by an optionalget_dynamic_dimsmethod; models without it keep exact per-dimension validation. This fixes scale-sensitive pooling heads (e.g. DistilUSE's tanh projection), which padding otherwise corrupts.Includes
createTextEmbeddings/createImageEmbeddingstasks,useTextEmbeddings/useImageEmbeddingshooks,models.textEmbeddings/models.imageEmbeddingsregistry entries, an interactive text-embeddings demo inapps/nlp, and a CLIP zero-shot image-embeddings demo inapps/computer-vision.Introduces a breaking change?
Type of change
Tested on
Testing instructions
nlpapp → Text Embeddings: seeds a sentence library; type a query and Find similar to rank by cosine similarity, switch models via the chips. Verified on a physical Android device (arm64): all-MiniLM-L6-v2 returns 384-dim L2-normalized embeddings (~25 ms/forward on XNNPACK); DistilUSE ranks correctly with a wide similarity spread (previously compressed by padding).computer-visionapp → Image Embeddings: pick an image and rank editable text labels via CLIP zero-shot (image vs. text embeddings). Verified on device.Screenshots
Related issues
#1247
Checklist
Additional notes
DistilUSE and CLIP (text) are re-exported with the
get_dynamic_dimsmethod and pinned tov0.10.0; the remaining text-embedding models (all-MiniLM-L6-v2, all-mpnet-base-v2, multi-qa MiniLM/MPNet, paraphrase-ML) still need re-export tov0.10.0.