Skip to content

[RNE Rewrite] OCR bucketed#1295

Draft
benITo47 wants to merge 29 commits into
rne-rewritefrom
rne-rewrite-ocr-bucketed
Draft

[RNE Rewrite] OCR bucketed#1295
benITo47 wants to merge 29 commits into
rne-rewritefrom
rne-rewrite-ocr-bucketed

Conversation

@benITo47

Copy link
Copy Markdown
Contributor

Description

Adds a unified, OCR + document understanding pipeline to react-native-executorch. Two detector/recognizer families ship out of the box - EasyOCR (CRAFT detector + CRNN) and PaddleOCR PP-OCRv6 (DBNet detector + SVTR) - behind one createOCR task and a useOCR hook, plus a higher-level document pipeline (createDocumentOCR / useDocumentOCR) that orchestrates orientation correction, UVDoc dewarp, PP-DocLayoutV3 region layout, SLANet table-structure recognition, and reading-order assembly into HTML.

Models use a bucketed PTE contract: each model ships per-size detect_ / recognize_ methods and the pipeline snaps each image to the closest bucket, so there are no dynamic-shape recompiles. The detector/recognizer share one baked contract (RGB input, recognizer normalization/padding, greedy-CTC decode, mean-prob confidence); detectorKind selects only the box decoder and default drop score.

The pipeline is extensible without forking: a 'custom' detectorKind accepts a TypeScript extractBoxes worklet (raw logits → quads, output shapes read from the PTE), and the recognizer's normalization/padding/decode are overridable per model.

Introduces a breaking change?

  • Yes
  • No

Type of change

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Documentation update (improves or adds clarity to existing documentation)
  • Other (chores, tests, code style improvements etc.)

Tested on

  • iOS
  • Android

Testing instructions

  1. OCR screen - pick PaddleOCR (XNNPACK/Vulkan) or EasyOCR; select an image from the gallery → Run OCR. Verify detected regions (green overlay) and the recognized text list with per-region confidence/latency.
  2. Toggle Vertical text and run a stacked/column image (e.g. signage, container codes) - confirm horizontal reads are unaffected and columns are read top-to-bottom.
  3. Document Pipeline screen - run a document photo; verify orientation correction, optional dewarp, layout regions, table HTML, and reading-order output.

Checklist

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings

Additional notes

benITo47 added 19 commits June 29, 2026 23:46
Two-stage OCR (EasyOCR CRAFT+CRNN / PaddleOCR DBNet+SVTR) plus a document
pipeline, on top of rne-rewrite.

- One fused PTE per model with bucketed detect_<S>/recognize_<W> methods and
  snap-to-closest sizing; a single baked contract, with only the box decoder
  (detectorKind: 'craft' | 'dbnet') and the drop score per architecture.
- Document pipeline: layout via createObjectDetector, native dewarp/gridSample,
  SLANet_plus table-structure recognition, structure-guided table HTML.
- Vertical reading (additive, opt-in): page-level column grouping for stacked
  signage + char-level second CRAFT pass + joint-hconcat recognition; tall lines
  are no longer flipped flat, and vertical reads skip the drop-score gate.
- Native ops: extractTextBoxes (CRAFT + DBNet), warpQuad, ctcGreedyDecode, gridSample.
- Models hosted on Hugging Face (EasyOCR, PP-OCRv6, PP-DocLayoutV3, PaddleHelpers),
  downloaded + cached on device; demo screens consume them directly.
… off by default

- ocr_ops.cpp: quantise DBNet quad y-coordinates into fixed row bands before
  sorting. The previous `|dy| > 10` comparator was not a strict-weak ordering
  (intransitive), which aborts under libc++ hardening.
- document demo: default dewarp OFF. UVDoc dewarp only helps photographed,
  physically-warped pages; on flat images it distorts otherwise-clean text.
  Updated the screen copy accordingly.
…run orientation/dewarp

Verified on-device (Android emulator, PaddleOCR/XNNPACK):

- Vertical OCR (ocr.ts): stacked columns were detected/placed correctly but
  read as garbage (a vertical "ANTIQUES" → " 1 "). Root cause: the DBNet
  detector emits one box per text region, not per glyph, fusing stacked letters
  into a few tall boxes; recognizeGlyphStrip warped each multi-letter box into a
  single recognizer cell (squashed → garbage), and the char-level re-detect path
  doesn't split for DBNet. Fix: add splitTallQuad() and split every glyph box
  into ~square single-letter cells (by height/width) before strip assembly.
  Now reads "ANTIQUES"/"PARKING" at 91-93% (both the column and tall-single
  paths). [Bug 2]

- Bucketed-OCR memory (model.cpp + core/model.ts + ocr.ts + documentOCR.ts):
  each detect_<S>/recognize_<W> method's planned-memory arena was cached for the
  model's lifetime, so memory grew unbounded as image/box sizes varied (worst on
  CoreML, one compiled graph per method). Ported main's unload-after-use via the
  ET API: expose Model.unloadMethod() (Module::unload_method) and free the bucket
  arenas after each top-level run (RunOCROptions.release, default true; the
  document orchestrator frees once per page). Measured: a 640→1280 two-image run
  holds ~693 MB native heap without unload vs ~341 MB with it. [Bug 3]

- Document orientation/dewarp (documentOCR.ts + document demo): were baked at
  createDocumentOCR time, so useModel never recreated the model on toggle and the
  switches did nothing. Made them per-run options on runDocumentOCR(input,
  {orientation, dewarp}) (mirroring OCR's vertical), defaulting to the config
  flags; toggles now take effect with no reload. [Bug 4 / in-flight]

Dewarp (Bug 1) needed no code change: on-device the gridSample [-1,1] backward-map
convention is correct (near-identity grid on flat pages, correctly flattens a
warped page); the mild flat-page distortion is UVDoc emitting a non-identity field
and is indistinguishable from a real warp by the grid alone, so default-OFF +
the per-run toggle is the right mitigation.
Reading order: add readingOrderIndices (column detection via x-coverage
sweep, within-column line grouping by vertical overlap, left-to-right
within a line, columns left-to-right). Apply to OCR detections and to each
document block's lines, replacing the detector's arbitrary / y-only order
so two-column pages, split titles, and label/value rows concatenate
correctly.

Dewarp guard: dewarpWorklet declines a degenerate warp (one that lacks
page boundaries and maps content off-canvas) by comparing sampled pixel
activity before/after; if the dewarped page keeps <50% of the source's
activity it returns the original, so dewarp can no longer collapse a page
to zero detections.
Warn the raw orientation logits (per-class), the argmax, the decoded
rotationCW and confidence from detectOrientationWorklet. console.warn so
it surfaces in native logs from the worklet thread.
Only apply page-rotation when the orientation classifier's softmax
confidence for its argmax class is >= 0.7 and the predicted angle is
non-zero, mirroring PaddleOCR's pipeline. Out-of-distribution inputs
(perspective photos, non-documents) produce low-confidence argmaxes that
spuriously flip the page; below threshold the page is treated as upright.
Genuine documents score >0.95; OOD frames can land ~0.74, so 0.85 leaves
margin to reject the spurious flips a 0.7 gate let through.
Factor a localize() helper that swaps a model spec's hosted modelPath for
its downloaded local path (undefined when the optional model is absent or
not yet downloaded). Replaces the nested conditional-spread localConfig
with a flat object, and aggregates progress/error over just the enabled
downloads. Behavior unchanged.
OCROptions gains recognizerNorm (alpha/beta), recognizerPadValue, and an
optional decode(logits, charset) -> {text, confidence}. Defaults preserve
the SVTR/CRNN contract ((x/255-0.5)/0.5, pad 128, greedy CTC), so existing
models are unchanged; a model with different normalization or a non-CTC
head (attention/AR) now slots in as pure config. decode runs on the
worklet thread (must be a worklet).
Drop the per-call console.warn of orientation logits added for the OOD
investigation; the confidence gate is the shipping fix.
…nputs

- Extract resizeFactors() (points.ts) so scalePoint and scaleBox derive the
  letterbox/stretch scale+offset once instead of each recomputing it (#6).
- boundingBoxOf / bboxOfQuad / boundingQuadOf return a zero box/quad for
  empty input instead of Infinity bounds (#11).
- orderQuad returns a copy unchanged when not given exactly 4 corners (#12).
detectQuads allocated 7 tensors per call (~30MB at the 960 bucket), freed
on return — wasteful on the vertical re-detect path that calls it per box.
Pre-allocate the channel-independent set per detect bucket at construction
(buildDetectorSets, mirroring buildRecognizerSets); detectQuads now only
allocates the source-resize tensor (the lone input-channel-dependent one).
Behavior unchanged; disposed alongside recSets.
The NMS suppression loop re-decoded box j via decodeToXyxy on every (i,j)
pair — O(N^2) decodes. Decode each candidate to xyxy+area once up front,
indexed by candidate position, and have both loops read the cached values.
Same result; decode work drops to O(N).
detectorKind gains 'custom': the model's raw detect_<S> outputs (shapes
read from the PTE method metadata, allocated for you) are handed to an
extractBoxes(outputs, s) worklet that returns quads in detector space —
the pipeline maps them to image pixels and applies dropScore, exactly
like the built-in craft/dbnet decoders. Pairs with the recognizer decode
hook so a fully foreign architecture slots in as config. Built-in paths
unchanged (DetSet now holds a tOutputs list; tOutputs[0] is the heatmap).
extractBoxes must be a worklet.
…ity work

- points.ts: move scalePoint's JSDoc back onto scalePoint (the resizeFactors
  insertion had orphaned it).
- ocr.ts: update the baked-contract comments now that recognizer norm/pad/
  decode and the detector are per-model overridable; mark RECOGNIZER_* and the
  detectQuads scratch comments as defaults / cached.
- ocrHelpers.ts: rename the within-line sort helper cx -> xSum (it returns the
  edge sum, not the center; avoids clashing with the column-center cx).
- ImageViewport.tsx: boxes are in the displayed image's px, not 'original'.
Include <jsi/jsi.h> directly (65 jsi:: uses, previously transitive) and
<opencv2/core/check.hpp>.
Split ocr.ts (1059 lines) so the task file holds only the public API +
createOCR factory:

- Move the tensor-pipeline engine (detectQuads, recognizeQuad,
  recognizeGlyphStrip, readStackedColumn, readBoxVertical), the per-bucket
  builders, and their context/set types into a new internal ocrPipeline.ts
  (imported only by ocr.ts; not re-exported from the package index).
- Extract validateDetectorSchema / buildExtractOpts / disposeDetSets /
  disposeRecSets, removing the duplicated recognizer/detector dispose loops.
- Hoist inline helpers to module scope: pushDetection (ocr.ts), lerp + xSum
  (ocrHelpers.ts).
- Drop the unused DetectContext.format field.

ocr.ts 1059 -> 459 lines. Behavior-preserving; verified on-device (Android):
detector localization + horizontal/vertical recognition unchanged.
…o image_ops

ocr_ops.cpp held two things that aren't OCR-specific:

- The JSI option-readers (getNumberProp/getStringProp/getBoolProp/getBoolPropOr)
  are generic plumbing — promote them to utils.h so image_ops/ocr_ops share one
  copy instead of re-rolling the same hasProperty/isX pattern.
- warpQuad is a generic perspective-crop image op (getPerspectiveTransform +
  warpPerspective + pad/align), no OCR in the math. Move it to image_ops.cpp next
  to resize/cvtColor/gridSample; update headers + install.cpp wiring.

ocr_ops.cpp 835 -> 679 lines; it now holds only OCR detector/sequence
postprocessing (CRAFT grouping, DBNet contouring, CTC argmax). Verified with
check-cpp-warnings.sh (clang++ -fsyntax-only vs ExecuTorch/JSI/OpenCV): clean.
…GHT rename

- models.ts: the OCR family comment claimed the recognizer profile (norm, color,
  padding, CTC, confidence) is 'derived from detectorKind' — stale. detectorKind
  only selects the box decoder + default drop score; the rest is the shared baked
  contract, now overridable via recognizerNorm/recognizerPadValue/decode.
- ImageViewport: keep VIEW_HEIGHT as-is (drop the DEFAULT_VIEW_HEIGHT rename) — the
  per-instance override already flows through viewHeight, so the rename was churn.
@benITo47 benITo47 requested a review from barhanc June 30, 2026 15:53
@barhanc barhanc added refactoring feature PRs that implement a new feature labels Jun 30, 2026
@barhanc barhanc changed the title Rne rewrite ocr bucketed [RNE Rewrite] ocr bucketed Jun 30, 2026
@msluszniak

Copy link
Copy Markdown
Member

Firstly, fix the PR description since it is struckthrough and fix clang tidy.

@msluszniak msluszniak linked an issue Jun 30, 2026 that may be closed by this pull request
@msluszniak msluszniak changed the title [RNE Rewrite] ocr bucketed [RNE Rewrite] OCR bucketed Jun 30, 2026

@barhanc barhanc left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I only went over the library implementation, didn't test apps, as there are quite some changes to be made.
  • Why do we have both OCR and DocumentOCR, what are the use-cases where one would want the OCR without the additional processing and if there are such wouldn't it be better to have it configurable from a single useOCR hook?
  • Since it seems that OCR is quite a large addition I think we can add ocr/ directory to cv/ that will host all the required pre/post processing helpers.
  • Please fix the cspell warnings.
  • Please fix the PR description.
  • The added repo names on huggingface do not conform to snake-case convention.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to the core should first be introduced in a separate PR. Also let's wait with this one because we will be adding support for dynamic methods and as I understand this is required because of the bucketed approach.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason this file changed? It looks to me these are purely stylistic changes.

Comment thread packages/react-native-executorch/cpp/extensions/cv/image_ops.h Outdated
Comment thread packages/react-native-executorch/cpp/extensions/cv/utils.h Outdated
Comment on lines +127 to +130
* Use this to bound native memory when many distinct methods are executed
* over a session — e.g. bucketed OCR, where each `detect_<S>`/`recognize_<W>`
* size that is ever run would otherwise stay resident for the model's
* lifetime.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this comment.

Comment thread packages/react-native-executorch/src/index.ts Outdated
Comment thread packages/react-native-executorch/src/models.ts
Comment thread packages/react-native-executorch/src/extensions/cv/tasks/ocrHelpers.ts Outdated
Comment thread packages/react-native-executorch/src/extensions/cv/tasks/ocr.ts Outdated
@benITo47 benITo47 marked this pull request as draft July 1, 2026 12:32
@benITo47 benITo47 force-pushed the rne-rewrite-ocr-bucketed branch from 67d86a8 to 1177254 Compare July 2, 2026 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature PRs that implement a new feature refactoring

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RNE Rewrite] CV - add OCR pipeline implementation

3 participants