[RNE Rewrite] OCR bucketed#1295
Draft
benITo47 wants to merge 29 commits into
Draft
Conversation
Two-stage OCR (EasyOCR CRAFT+CRNN / PaddleOCR DBNet+SVTR) plus a document pipeline, on top of rne-rewrite. - One fused PTE per model with bucketed detect_<S>/recognize_<W> methods and snap-to-closest sizing; a single baked contract, with only the box decoder (detectorKind: 'craft' | 'dbnet') and the drop score per architecture. - Document pipeline: layout via createObjectDetector, native dewarp/gridSample, SLANet_plus table-structure recognition, structure-guided table HTML. - Vertical reading (additive, opt-in): page-level column grouping for stacked signage + char-level second CRAFT pass + joint-hconcat recognition; tall lines are no longer flipped flat, and vertical reads skip the drop-score gate. - Native ops: extractTextBoxes (CRAFT + DBNet), warpQuad, ctcGreedyDecode, gridSample. - Models hosted on Hugging Face (EasyOCR, PP-OCRv6, PP-DocLayoutV3, PaddleHelpers), downloaded + cached on device; demo screens consume them directly.
… off by default - ocr_ops.cpp: quantise DBNet quad y-coordinates into fixed row bands before sorting. The previous `|dy| > 10` comparator was not a strict-weak ordering (intransitive), which aborts under libc++ hardening. - document demo: default dewarp OFF. UVDoc dewarp only helps photographed, physically-warped pages; on flat images it distorts otherwise-clean text. Updated the screen copy accordingly.
…run orientation/dewarp
Verified on-device (Android emulator, PaddleOCR/XNNPACK):
- Vertical OCR (ocr.ts): stacked columns were detected/placed correctly but
read as garbage (a vertical "ANTIQUES" → " 1 "). Root cause: the DBNet
detector emits one box per text region, not per glyph, fusing stacked letters
into a few tall boxes; recognizeGlyphStrip warped each multi-letter box into a
single recognizer cell (squashed → garbage), and the char-level re-detect path
doesn't split for DBNet. Fix: add splitTallQuad() and split every glyph box
into ~square single-letter cells (by height/width) before strip assembly.
Now reads "ANTIQUES"/"PARKING" at 91-93% (both the column and tall-single
paths). [Bug 2]
- Bucketed-OCR memory (model.cpp + core/model.ts + ocr.ts + documentOCR.ts):
each detect_<S>/recognize_<W> method's planned-memory arena was cached for the
model's lifetime, so memory grew unbounded as image/box sizes varied (worst on
CoreML, one compiled graph per method). Ported main's unload-after-use via the
ET API: expose Model.unloadMethod() (Module::unload_method) and free the bucket
arenas after each top-level run (RunOCROptions.release, default true; the
document orchestrator frees once per page). Measured: a 640→1280 two-image run
holds ~693 MB native heap without unload vs ~341 MB with it. [Bug 3]
- Document orientation/dewarp (documentOCR.ts + document demo): were baked at
createDocumentOCR time, so useModel never recreated the model on toggle and the
switches did nothing. Made them per-run options on runDocumentOCR(input,
{orientation, dewarp}) (mirroring OCR's vertical), defaulting to the config
flags; toggles now take effect with no reload. [Bug 4 / in-flight]
Dewarp (Bug 1) needed no code change: on-device the gridSample [-1,1] backward-map
convention is correct (near-identity grid on flat pages, correctly flattens a
warped page); the mild flat-page distortion is UVDoc emitting a non-identity field
and is indistinguishable from a real warp by the grid alone, so default-OFF +
the per-run toggle is the right mitigation.
Reading order: add readingOrderIndices (column detection via x-coverage sweep, within-column line grouping by vertical overlap, left-to-right within a line, columns left-to-right). Apply to OCR detections and to each document block's lines, replacing the detector's arbitrary / y-only order so two-column pages, split titles, and label/value rows concatenate correctly. Dewarp guard: dewarpWorklet declines a degenerate warp (one that lacks page boundaries and maps content off-canvas) by comparing sampled pixel activity before/after; if the dewarped page keeps <50% of the source's activity it returns the original, so dewarp can no longer collapse a page to zero detections.
Warn the raw orientation logits (per-class), the argmax, the decoded rotationCW and confidence from detectOrientationWorklet. console.warn so it surfaces in native logs from the worklet thread.
Only apply page-rotation when the orientation classifier's softmax confidence for its argmax class is >= 0.7 and the predicted angle is non-zero, mirroring PaddleOCR's pipeline. Out-of-distribution inputs (perspective photos, non-documents) produce low-confidence argmaxes that spuriously flip the page; below threshold the page is treated as upright.
Genuine documents score >0.95; OOD frames can land ~0.74, so 0.85 leaves margin to reject the spurious flips a 0.7 gate let through.
Factor a localize() helper that swaps a model spec's hosted modelPath for its downloaded local path (undefined when the optional model is absent or not yet downloaded). Replaces the nested conditional-spread localConfig with a flat object, and aggregates progress/error over just the enabled downloads. Behavior unchanged.
OCROptions gains recognizerNorm (alpha/beta), recognizerPadValue, and an
optional decode(logits, charset) -> {text, confidence}. Defaults preserve
the SVTR/CRNN contract ((x/255-0.5)/0.5, pad 128, greedy CTC), so existing
models are unchanged; a model with different normalization or a non-CTC
head (attention/AR) now slots in as pure config. decode runs on the
worklet thread (must be a worklet).
Drop the per-call console.warn of orientation logits added for the OOD investigation; the confidence gate is the shipping fix.
…nputs - Extract resizeFactors() (points.ts) so scalePoint and scaleBox derive the letterbox/stretch scale+offset once instead of each recomputing it (#6). - boundingBoxOf / bboxOfQuad / boundingQuadOf return a zero box/quad for empty input instead of Infinity bounds (#11). - orderQuad returns a copy unchanged when not given exactly 4 corners (#12).
detectQuads allocated 7 tensors per call (~30MB at the 960 bucket), freed on return — wasteful on the vertical re-detect path that calls it per box. Pre-allocate the channel-independent set per detect bucket at construction (buildDetectorSets, mirroring buildRecognizerSets); detectQuads now only allocates the source-resize tensor (the lone input-channel-dependent one). Behavior unchanged; disposed alongside recSets.
The NMS suppression loop re-decoded box j via decodeToXyxy on every (i,j) pair — O(N^2) decodes. Decode each candidate to xyxy+area once up front, indexed by candidate position, and have both loops read the cached values. Same result; decode work drops to O(N).
detectorKind gains 'custom': the model's raw detect_<S> outputs (shapes read from the PTE method metadata, allocated for you) are handed to an extractBoxes(outputs, s) worklet that returns quads in detector space — the pipeline maps them to image pixels and applies dropScore, exactly like the built-in craft/dbnet decoders. Pairs with the recognizer decode hook so a fully foreign architecture slots in as config. Built-in paths unchanged (DetSet now holds a tOutputs list; tOutputs[0] is the heatmap). extractBoxes must be a worklet.
…ity work - points.ts: move scalePoint's JSDoc back onto scalePoint (the resizeFactors insertion had orphaned it). - ocr.ts: update the baked-contract comments now that recognizer norm/pad/ decode and the detector are per-model overridable; mark RECOGNIZER_* and the detectQuads scratch comments as defaults / cached. - ocrHelpers.ts: rename the within-line sort helper cx -> xSum (it returns the edge sum, not the center; avoids clashing with the column-center cx). - ImageViewport.tsx: boxes are in the displayed image's px, not 'original'.
Include <jsi/jsi.h> directly (65 jsi:: uses, previously transitive) and <opencv2/core/check.hpp>.
Split ocr.ts (1059 lines) so the task file holds only the public API + createOCR factory: - Move the tensor-pipeline engine (detectQuads, recognizeQuad, recognizeGlyphStrip, readStackedColumn, readBoxVertical), the per-bucket builders, and their context/set types into a new internal ocrPipeline.ts (imported only by ocr.ts; not re-exported from the package index). - Extract validateDetectorSchema / buildExtractOpts / disposeDetSets / disposeRecSets, removing the duplicated recognizer/detector dispose loops. - Hoist inline helpers to module scope: pushDetection (ocr.ts), lerp + xSum (ocrHelpers.ts). - Drop the unused DetectContext.format field. ocr.ts 1059 -> 459 lines. Behavior-preserving; verified on-device (Android): detector localization + horizontal/vertical recognition unchanged.
…o image_ops ocr_ops.cpp held two things that aren't OCR-specific: - The JSI option-readers (getNumberProp/getStringProp/getBoolProp/getBoolPropOr) are generic plumbing — promote them to utils.h so image_ops/ocr_ops share one copy instead of re-rolling the same hasProperty/isX pattern. - warpQuad is a generic perspective-crop image op (getPerspectiveTransform + warpPerspective + pad/align), no OCR in the math. Move it to image_ops.cpp next to resize/cvtColor/gridSample; update headers + install.cpp wiring. ocr_ops.cpp 835 -> 679 lines; it now holds only OCR detector/sequence postprocessing (CRAFT grouping, DBNet contouring, CTC argmax). Verified with check-cpp-warnings.sh (clang++ -fsyntax-only vs ExecuTorch/JSI/OpenCV): clean.
…GHT rename - models.ts: the OCR family comment claimed the recognizer profile (norm, color, padding, CTC, confidence) is 'derived from detectorKind' — stale. detectorKind only selects the box decoder + default drop score; the rest is the shared baked contract, now overridable via recognizerNorm/recognizerPadValue/decode. - ImageViewport: keep VIEW_HEIGHT as-is (drop the DEFAULT_VIEW_HEIGHT rename) — the per-instance override already flows through viewHeight, so the rename was churn.
Member
|
Firstly, fix the PR description since it is struckthrough and fix clang tidy. |
barhanc
requested changes
Jul 1, 2026
Contributor
There was a problem hiding this comment.
- I only went over the library implementation, didn't test apps, as there are quite some changes to be made.
- Why do we have both OCR and DocumentOCR, what are the use-cases where one would want the OCR without the additional processing and if there are such wouldn't it be better to have it configurable from a single
useOCRhook? - Since it seems that OCR is quite a large addition I think we can add
ocr/directory tocv/that will host all the required pre/post processing helpers. - Please fix the cspell warnings.
- Please fix the PR description.
- The added repo names on huggingface do not conform to snake-case convention.
Contributor
There was a problem hiding this comment.
Changes to the core should first be introduced in a separate PR. Also let's wait with this one because we will be adding support for dynamic methods and as I understand this is required because of the bucketed approach.
Contributor
There was a problem hiding this comment.
What is the reason this file changed? It looks to me these are purely stylistic changes.
Comment on lines
+127
to
+130
| * Use this to bound native memory when many distinct methods are executed | ||
| * over a session — e.g. bucketed OCR, where each `detect_<S>`/`recognize_<W>` | ||
| * size that is ever run would otherwise stay resident for the model's | ||
| * lifetime. |
Contributor
There was a problem hiding this comment.
Please remove this comment.
…_boxes_ops, inline JSI option readers, warpQuad offset/clear
…ensor-thread document pipeline + Category-B builders
…mer native contract, generic bounds
67d86a8 to
1177254
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds a unified, OCR + document understanding pipeline to react-native-executorch. Two detector/recognizer families ship out of the box - EasyOCR (CRAFT detector + CRNN) and PaddleOCR PP-OCRv6 (DBNet detector + SVTR) - behind one createOCR task and a useOCR hook, plus a higher-level document pipeline (createDocumentOCR / useDocumentOCR) that orchestrates orientation correction, UVDoc dewarp, PP-DocLayoutV3 region layout, SLANet table-structure recognition, and reading-order assembly into HTML.
Models use a bucketed PTE contract: each model ships per-size detect_
/ recognize_ methods and the pipeline snaps each image to the closest bucket, so there are no dynamic-shape recompiles. The detector/recognizer share one baked contract (RGB input, recognizer normalization/padding, greedy-CTC decode, mean-prob confidence); detectorKind selects only the box decoder and default drop score.The pipeline is extensible without forking: a 'custom' detectorKind accepts a TypeScript extractBoxes worklet (raw logits → quads, output shapes read from the PTE), and the recognizer's normalization/padding/decode are overridable per model.
Introduces a breaking change?
Type of change
Tested on
Testing instructions
Checklist
Additional notes