OPENNLP-1850: Layered Term model — Term, TermAnalyzer (2b/7) by krickert · Pull Request #1111 · apache/opennlp

krickert · 2026-06-23T15:18:43Z

Part 2b of the OPENNLP-1850 stack: the token-analysis layer, split out of the former tokenizer PR (#1104) on review request.

A Term is one token projected through the ordered Dimension stack (original, NFC, NFKC, whitespace, dash, case fold, accent fold, confusable fold, stem, lemma), keeping its source Span and every intermediate form. TermAnalyzer segments with the UAX #29 WordTokenizer (from 2a) and applies the configured dimension prefix. Restores Dimension's {@link Term}/{@link TermAnalyzer} javadoc now that those types exist.

Base: OPENNLP-1850-2a-tokenizer (#1110). Stack: 1a → 1b → 2a → 2b (this) → 2c → DL → docs.

rzo1 · 2026-06-25T12:08:11Z

Builder.dashes() is plural while every other layer-enable method is singular (nfc, whitespace, caseFold, accentFold) and the enum constant is DASH. Suggest dash() for consistency.
Class javadoc @links NormalizationProfile#searchAnalyzer(), introduced only in OPENNLP-1850: Per-language NormalizationProfile registry (2c/7) #1112 - dangling link until that lands.
analyze(CharSequence) on a lemmatize()-configured analyzer throws at Term construction (eager LEMMA, null posTag), which reads against the javadoc phrasing that LEMMA "is not available from them." Either clarify the doc or skip unsatisfiable eager layers. Currently untested.

The token analysis layer split out of the former tokenizer PR (#1104) on review request. A Term is one token projected through the ordered Dimension stack (original, NFC, NFKC, whitespace, dash, case fold, accent fold, confusable fold, stem, lemma), keeping its source Span and every intermediate form; TermAnalyzer segments with the UAX #29 WordTokenizer (from 2a) and applies the configured dimension prefix. Restores Dimension's {@link Term}/{@link TermAnalyzer} javadoc now that they exist. Builds on the tokenizer in 2a.

…ften forward-link (Term) Rename TermAnalyzer.Builder.dashes() -> dash() for consistency with the singular layer-enable methods (nfc/whitespace/caseFold/accentFold) and the DASH enum. Clarify that analyze(CharSequence) fails loud when a lemmatizer is configured (no POS tags) and add a test for it. Soften the NormalizationProfile forward-link to {@code}.

krickert · 2026-06-25T18:18:07Z

@rzo1 All three addressed (tip a23a5135).

dashes() → dash(). Renamed both overloads on TermAnalyzer.Builder for consistency with the singular layer-enable methods and the DASH enum. (TextNormalizer.Builder.dashes() keeps the plural — there it sits among quotes()/digits()/bullets(), so plural is the consistent choice in that class.) No callers needed updating.

Dangling searchAnalyzer() link. Softened the class-javadoc {@link NormalizationProfile#searchAnalyzer()} to {@code NormalizationProfile.searchAnalyzer()}, so 2b carries no dangling link to a type that lands in #1112.

analyze(CharSequence) + eager LEMMA. Clarified the doc and kept the fail-loud throw rather than silently skipping the layer. analyze(CharSequence) has no POS tags, so a configured LEMMA layer genuinely can't be computed from that entry point; silently dropping it would hide a misconfiguration (a lemmatizer was configured but never runs). The javadoc now states it throws when a lemmatizer is configured and points to analyze(tokens, tags) for the lemma path. Added testAnalyzeCharSequenceFailsLoudlyWhenLemmaConfigured covering it.

This was referenced Jun 23, 2026

OPENNLP-1850: Per-language NormalizationProfile registry (2c/7) #1112

Open

OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104

Closed

krickert force-pushed the OPENNLP-1850-2a-tokenizer branch from a450069 to dc02b9e Compare June 24, 2026 11:20

krickert force-pushed the OPENNLP-1850-2b-term branch from 57e2b58 to 82cb041 Compare June 24, 2026 11:20

krickert force-pushed the OPENNLP-1850-2a-tokenizer branch from dc02b9e to dd1906d Compare June 24, 2026 11:54

krickert force-pushed the OPENNLP-1850-2b-term branch from 82cb041 to e35e859 Compare June 24, 2026 11:54

krickert force-pushed the OPENNLP-1850-2a-tokenizer branch from dd1906d to 3fae8aa Compare June 25, 2026 08:26

krickert force-pushed the OPENNLP-1850-2b-term branch from e35e859 to 55dbeb4 Compare June 25, 2026 08:26

krickert mentioned this pull request Jun 25, 2026

OPENNLP-1850: UAX #29 word tokenizer — WordSegmenter, WordTokenizer, WordType (2a/7) #1110

Open

krickert marked this pull request as ready for review June 25, 2026 11:28

krickert added 2 commits June 25, 2026 13:16

krickert force-pushed the OPENNLP-1850-2a-tokenizer branch from 3fae8aa to f2d1d8c Compare June 25, 2026 17:25

krickert force-pushed the OPENNLP-1850-2b-term branch from 55dbeb4 to a23a513 Compare June 25, 2026 17:25

rzo1 requested review from atarora, jzonthemtn and mawiesne June 26, 2026 13:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OPENNLP-1850: Layered Term model — Term, TermAnalyzer (2b/7)#1111

OPENNLP-1850: Layered Term model — Term, TermAnalyzer (2b/7)#1111
krickert wants to merge 2 commits into
OPENNLP-1850-2a-tokenizerfrom
OPENNLP-1850-2b-term

krickert commented Jun 23, 2026

Uh oh!

rzo1 commented Jun 25, 2026

Uh oh!

krickert commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

krickert commented Jun 23, 2026

Uh oh!

rzo1 commented Jun 25, 2026

Uh oh!

krickert commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants