[RNE Rewrite] OCR bucketed by benITo47 · Pull Request #1295 · software-mansion/react-native-executorch

benITo47 · 2026-06-30T15:53:42Z

Description

Adds a unified, OCR + document understanding pipeline to react-native-executorch. Two detector/recognizer families ship out of the box - EasyOCR (CRAFT detector + CRNN) and PaddleOCR PP-OCRv6 (DBNet detector + SVTR) - behind one createOCR task and a useOCR hook, plus a higher-level document pipeline (createDocumentOCR / useDocumentOCR) that orchestrates orientation correction, UVDoc dewarp, PP-DocLayoutV3 region layout, SLANet table-structure recognition, and reading-order assembly into HTML.

Models use a bucketed PTE contract: each model ships per-size detect_ / recognize_ methods and the pipeline snaps each image to the closest bucket, so there are no dynamic-shape recompiles. The detector/recognizer share one baked contract (RGB input, recognizer normalization/padding, greedy-CTC decode, mean-prob confidence); detectorKind selects only the box decoder and default drop score.

The pipeline is extensible without forking: a 'custom' detectorKind accepts a TypeScript extractBoxes worklet (raw logits → quads, output shapes read from the PTE), and the recognizer's normalization/padding/decode are overridable per model.

Introduces a breaking change?

Yes

No

Type of change

Bug fix (change which fixes an issue)

New feature (change which adds functionality)

Documentation update (improves or adds clarity to existing documentation)

Other (chores, tests, code style improvements etc.)

Tested on

iOS

Android

Testing instructions

OCR screen - pick PaddleOCR (XNNPACK/Vulkan) or EasyOCR; select an image from the gallery → Run OCR. Verify detected regions (green overlay) and the recognized text list with per-region confidence/latency.

Toggle Vertical text and run a stacked/column image (e.g. signage, container codes) - confirm horizontal reads are unaffected and columns are read top-to-bottom.

Document Pipeline screen - run a document photo; verify orientation correction, optional dewarp, layout regions, table HTML, and reading-order output.

Checklist

I have performed a self-review of my code

I have commented my code, particularly in hard-to-understand areas

I have updated the documentation accordingly

My changes generate no new warnings

Additional notes

Two-stage OCR (EasyOCR CRAFT+CRNN / PaddleOCR DBNet+SVTR) plus a document pipeline, on top of rne-rewrite. - One fused PTE per model with bucketed detect_<S>/recognize_<W> methods and snap-to-closest sizing; a single baked contract, with only the box decoder (detectorKind: 'craft' | 'dbnet') and the drop score per architecture. - Document pipeline: layout via createObjectDetector, native dewarp/gridSample, SLANet_plus table-structure recognition, structure-guided table HTML. - Vertical reading (additive, opt-in): page-level column grouping for stacked signage + char-level second CRAFT pass + joint-hconcat recognition; tall lines are no longer flipped flat, and vertical reads skip the drop-score gate. - Native ops: extractTextBoxes (CRAFT + DBNet), warpQuad, ctcGreedyDecode, gridSample. - Models hosted on Hugging Face (EasyOCR, PP-OCRv6, PP-DocLayoutV3, PaddleHelpers), downloaded + cached on device; demo screens consume them directly.

… off by default - ocr_ops.cpp: quantise DBNet quad y-coordinates into fixed row bands before sorting. The previous `|dy| > 10` comparator was not a strict-weak ordering (intransitive), which aborts under libc++ hardening. - document demo: default dewarp OFF. UVDoc dewarp only helps photographed, physically-warped pages; on flat images it distorts otherwise-clean text. Updated the screen copy accordingly.

…run orientation/dewarp Verified on-device (Android emulator, PaddleOCR/XNNPACK): - Vertical OCR (ocr.ts): stacked columns were detected/placed correctly but read as garbage (a vertical "ANTIQUES" → " 1 "). Root cause: the DBNet detector emits one box per text region, not per glyph, fusing stacked letters into a few tall boxes; recognizeGlyphStrip warped each multi-letter box into a single recognizer cell (squashed → garbage), and the char-level re-detect path doesn't split for DBNet. Fix: add splitTallQuad() and split every glyph box into ~square single-letter cells (by height/width) before strip assembly. Now reads "ANTIQUES"/"PARKING" at 91-93% (both the column and tall-single paths). [Bug 2] - Bucketed-OCR memory (model.cpp + core/model.ts + ocr.ts + documentOCR.ts): each detect_<S>/recognize_<W> method's planned-memory arena was cached for the model's lifetime, so memory grew unbounded as image/box sizes varied (worst on CoreML, one compiled graph per method). Ported main's unload-after-use via the ET API: expose Model.unloadMethod() (Module::unload_method) and free the bucket arenas after each top-level run (RunOCROptions.release, default true; the document orchestrator frees once per page). Measured: a 640→1280 two-image run holds ~693 MB native heap without unload vs ~341 MB with it. [Bug 3] - Document orientation/dewarp (documentOCR.ts + document demo): were baked at createDocumentOCR time, so useModel never recreated the model on toggle and the switches did nothing. Made them per-run options on runDocumentOCR(input, {orientation, dewarp}) (mirroring OCR's vertical), defaulting to the config flags; toggles now take effect with no reload. [Bug 4 / in-flight] Dewarp (Bug 1) needed no code change: on-device the gridSample [-1,1] backward-map convention is correct (near-identity grid on flat pages, correctly flattens a warped page); the mild flat-page distortion is UVDoc emitting a non-identity field and is indistinguishable from a real warp by the grid alone, so default-OFF + the per-run toggle is the right mitigation.

Reading order: add readingOrderIndices (column detection via x-coverage sweep, within-column line grouping by vertical overlap, left-to-right within a line, columns left-to-right). Apply to OCR detections and to each document block's lines, replacing the detector's arbitrary / y-only order so two-column pages, split titles, and label/value rows concatenate correctly. Dewarp guard: dewarpWorklet declines a degenerate warp (one that lacks page boundaries and maps content off-canvas) by comparing sampled pixel activity before/after; if the dewarped page keeps <50% of the source's activity it returns the original, so dewarp can no longer collapse a page to zero detections.

Warn the raw orientation logits (per-class), the argmax, the decoded rotationCW and confidence from detectOrientationWorklet. console.warn so it surfaces in native logs from the worklet thread.

Only apply page-rotation when the orientation classifier's softmax confidence for its argmax class is >= 0.7 and the predicted angle is non-zero, mirroring PaddleOCR's pipeline. Out-of-distribution inputs (perspective photos, non-documents) produce low-confidence argmaxes that spuriously flip the page; below threshold the page is treated as upright.

Genuine documents score >0.95; OOD frames can land ~0.74, so 0.85 leaves margin to reject the spurious flips a 0.7 gate let through.

Factor a localize() helper that swaps a model spec's hosted modelPath for its downloaded local path (undefined when the optional model is absent or not yet downloaded). Replaces the nested conditional-spread localConfig with a flat object, and aggregates progress/error over just the enabled downloads. Behavior unchanged.

OCROptions gains recognizerNorm (alpha/beta), recognizerPadValue, and an optional decode(logits, charset) -> {text, confidence}. Defaults preserve the SVTR/CRNN contract ((x/255-0.5)/0.5, pad 128, greedy CTC), so existing models are unchanged; a model with different normalization or a non-CTC head (attention/AR) now slots in as pure config. decode runs on the worklet thread (must be a worklet).

Drop the per-call console.warn of orientation logits added for the OOD investigation; the confidence gate is the shipping fix.

…nputs - Extract resizeFactors() (points.ts) so scalePoint and scaleBox derive the letterbox/stretch scale+offset once instead of each recomputing it (#6). - boundingBoxOf / bboxOfQuad / boundingQuadOf return a zero box/quad for empty input instead of Infinity bounds (#11). - orderQuad returns a copy unchanged when not given exactly 4 corners (#12).

detectQuads allocated 7 tensors per call (~30MB at the 960 bucket), freed on return — wasteful on the vertical re-detect path that calls it per box. Pre-allocate the channel-independent set per detect bucket at construction (buildDetectorSets, mirroring buildRecognizerSets); detectQuads now only allocates the source-resize tensor (the lone input-channel-dependent one). Behavior unchanged; disposed alongside recSets.

The NMS suppression loop re-decoded box j via decodeToXyxy on every (i,j) pair — O(N^2) decodes. Decode each candidate to xyxy+area once up front, indexed by candidate position, and have both loops read the cached values. Same result; decode work drops to O(N).

detectorKind gains 'custom': the model's raw detect_<S> outputs (shapes read from the PTE method metadata, allocated for you) are handed to an extractBoxes(outputs, s) worklet that returns quads in detector space — the pipeline maps them to image pixels and applies dropScore, exactly like the built-in craft/dbnet decoders. Pairs with the recognizer decode hook so a fully foreign architecture slots in as config. Built-in paths unchanged (DetSet now holds a tOutputs list; tOutputs[0] is the heatmap). extractBoxes must be a worklet.

…ity work - points.ts: move scalePoint's JSDoc back onto scalePoint (the resizeFactors insertion had orphaned it). - ocr.ts: update the baked-contract comments now that recognizer norm/pad/ decode and the detector are per-model overridable; mark RECOGNIZER_* and the detectQuads scratch comments as defaults / cached. - ocrHelpers.ts: rename the within-line sort helper cx -> xSum (it returns the edge sum, not the center; avoids clashing with the column-center cx). - ImageViewport.tsx: boxes are in the displayed image's px, not 'original'.

Include <jsi/jsi.h> directly (65 jsi:: uses, previously transitive) and <opencv2/core/check.hpp>.

Split ocr.ts (1059 lines) so the task file holds only the public API + createOCR factory: - Move the tensor-pipeline engine (detectQuads, recognizeQuad, recognizeGlyphStrip, readStackedColumn, readBoxVertical), the per-bucket builders, and their context/set types into a new internal ocrPipeline.ts (imported only by ocr.ts; not re-exported from the package index). - Extract validateDetectorSchema / buildExtractOpts / disposeDetSets / disposeRecSets, removing the duplicated recognizer/detector dispose loops. - Hoist inline helpers to module scope: pushDetection (ocr.ts), lerp + xSum (ocrHelpers.ts). - Drop the unused DetectContext.format field. ocr.ts 1059 -> 459 lines. Behavior-preserving; verified on-device (Android): detector localization + horizontal/vertical recognition unchanged.

…o image_ops ocr_ops.cpp held two things that aren't OCR-specific: - The JSI option-readers (getNumberProp/getStringProp/getBoolProp/getBoolPropOr) are generic plumbing — promote them to utils.h so image_ops/ocr_ops share one copy instead of re-rolling the same hasProperty/isX pattern. - warpQuad is a generic perspective-crop image op (getPerspectiveTransform + warpPerspective + pad/align), no OCR in the math. Move it to image_ops.cpp next to resize/cvtColor/gridSample; update headers + install.cpp wiring. ocr_ops.cpp 835 -> 679 lines; it now holds only OCR detector/sequence postprocessing (CRAFT grouping, DBNet contouring, CTC argmax). Verified with check-cpp-warnings.sh (clang++ -fsyntax-only vs ExecuTorch/JSI/OpenCV): clean.

…GHT rename - models.ts: the OCR family comment claimed the recognizer profile (norm, color, padding, CTC, confidence) is 'derived from detectorKind' — stale. detectorKind only selects the box decoder + default drop score; the rest is the shared baked contract, now overridable via recognizerNorm/recognizerPadValue/decode. - ImageViewport: keep VIEW_HEIGHT as-is (drop the DEFAULT_VIEW_HEIGHT rename) — the per-instance override already flows through viewHeight, so the rename was churn.

msluszniak · 2026-06-30T16:09:07Z

Firstly, fix the PR description since it is struckthrough and fix clang tidy.

barhanc

I only went over the library implementation, didn't test apps, as there are quite some changes to be made.
Why do we have both OCR and DocumentOCR, what are the use-cases where one would want the OCR without the additional processing and if there are such wouldn't it be better to have it configurable from a single useOCR hook?
Since it seems that OCR is quite a large addition I think we can add ocr/ directory to cv/ that will host all the required pre/post processing helpers.
Please fix the cspell warnings.
Please fix the PR description.
The added repo names on huggingface do not conform to snake-case convention.

barhanc · 2026-07-01T10:13:35Z

Changes to the core should first be introduced in a separate PR. Also let's wait with this one because we will be adding support for dynamic methods and as I understand this is required because of the bucketed approach.

barhanc · 2026-07-01T10:14:41Z

What is the reason this file changed? It looks to me these are purely stylistic changes.

barhanc · 2026-07-01T10:24:40Z

+   * Use this to bound native memory when many distinct methods are executed
+   * over a session — e.g. bucketed OCR, where each `detect_<S>`/`recognize_<W>`
+   * size that is ever run would otherwise stay resident for the model's
+   * lifetime.


Please remove this comment.

…_boxes_ops, inline JSI option readers, warpQuad offset/clear

…ensor-thread document pipeline + Category-B builders

…, cspell

…Grid

…k-safe factories

…ming, JSDoc

…fno-rtti

…mer native contract, generic bounds

benITo47 added 19 commits June 29, 2026 23:46

[RNE Rewrite] debug(ocr): log orientation head output

9a92ce3

Warn the raw orientation logits (per-class), the argmax, the decoded rotationCW and confidence from detectOrientationWorklet. console.warn so it surfaces in native logs from the worklet thread.

[RNE Rewrite] fix(ocr): raise orientation confidence gate to 0.85

f6f30e6

Genuine documents score >0.95; OOD frames can land ~0.74, so 0.85 leaves margin to reject the spurious flips a 0.7 gate let through.

[RNE Rewrite] chore(ocr): remove orientation debug logging

5fae2c1

Drop the per-call console.warn of orientation logits added for the OOD investigation; the confidence gate is the shipping fix.

[RNE Rewrite] chore(ocr): include-what-you-use in ocr_ops.cpp

6bd1d99

Include <jsi/jsi.h> directly (65 jsi:: uses, previously transitive) and <opencv2/core/check.hpp>.

benITo47 requested a review from barhanc June 30, 2026 15:53

barhanc assigned benITo47 Jun 30, 2026

barhanc added refactoring feature PRs that implement a new feature labels Jun 30, 2026

barhanc changed the title ~~Rne rewrite ocr bucketed~~ [RNE Rewrite] ocr bucketed Jun 30, 2026

msluszniak linked an issue Jun 30, 2026 that may be closed by this pull request

[RNE Rewrite] CV - add OCR pipeline implementation #1240

Open

msluszniak changed the title ~~[RNE Rewrite] ocr bucketed~~ [RNE Rewrite] OCR bucketed Jun 30, 2026

[RNE Rewrite] satisfy clang-tidy on ocr_ops/image_ops

6080120

barhanc requested changes Jul 1, 2026

View reviewed changes

benITo47 marked this pull request as draft July 1, 2026 12:32

benITo47 added 9 commits July 1, 2026 19:11

[RNE Rewrite] refactor(cv): reorg native ops — add rotate, split text…

98bd0c1

…_boxes_ops, inline JSI option readers, warpQuad offset/clear

[RNE Rewrite] refactor(ocr): move into cv/ocr, split textBoxes ops, t…

3640b17

…ensor-thread document pipeline + Category-B builders

[RNE Rewrite] chore(ocr): explicit per-backend model presets, exports…

9e859a2

…, cspell

[RNE Rewrite] refactor(cv): split OCR native ops, add warpQuad/warpBy…

b31d924

…Grid

[RNE Rewrite] refactor(ocr): Ocr camelcase, pluggable extractors, lea…

b521f1f

…k-safe factories

[RNE Rewrite] chore: ocr screen cleanup, cspell words

8432dd7

[RNE Rewrite] refactor(ocr): audit cleanup — single bounds helper, na…

773eecf

…ming, JSDoc

[RNE Rewrite] fix(cv): catch std::exception — prebuilt OpenCV ships -…

507f7be

…fno-rtti

[RNE Rewrite] refactor(ocr): review cleanup — ocr family folder, slim…

1177254

…mer native contract, generic bounds

benITo47 force-pushed the rne-rewrite-ocr-bucketed branch from 67d86a8 to 1177254 Compare July 2, 2026 11:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RNE Rewrite] OCR bucketed#1295

[RNE Rewrite] OCR bucketed#1295
benITo47 wants to merge 29 commits into
rne-rewritefrom
rne-rewrite-ocr-bucketed

benITo47 commented Jun 30, 2026

Uh oh!

msluszniak commented Jun 30, 2026

Uh oh!

barhanc left a comment •

edited

Loading

Uh oh!

barhanc Jul 1, 2026

Uh oh!

barhanc Jul 1, 2026

Uh oh!

Uh oh!

Uh oh!

barhanc Jul 1, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

benITo47 commented Jun 30, 2026

Description

Introduces a breaking change?

Type of change

Tested on

Testing instructions

Checklist

Additional notes

Uh oh!

msluszniak commented Jun 30, 2026

Uh oh!

barhanc left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

barhanc Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

barhanc Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

barhanc Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

barhanc left a comment •

edited

Loading