Skip to content

OPENNLP-1850: Layered Term model — Term, TermAnalyzer (2b/7)#1111

Open
krickert wants to merge 2 commits into
OPENNLP-1850-2a-tokenizerfrom
OPENNLP-1850-2b-term
Open

OPENNLP-1850: Layered Term model — Term, TermAnalyzer (2b/7)#1111
krickert wants to merge 2 commits into
OPENNLP-1850-2a-tokenizerfrom
OPENNLP-1850-2b-term

Conversation

@krickert

Copy link
Copy Markdown
Contributor

Part 2b of the OPENNLP-1850 stack: the token-analysis layer, split out of the former tokenizer PR (#1104) on review request.

A Term is one token projected through the ordered Dimension stack (original, NFC, NFKC, whitespace, dash, case fold, accent fold, confusable fold, stem, lemma), keeping its source Span and every intermediate form. TermAnalyzer segments with the UAX #29 WordTokenizer (from 2a) and applies the configured dimension prefix. Restores Dimension's {@link Term}/{@link TermAnalyzer} javadoc now that those types exist.

Base: OPENNLP-1850-2a-tokenizer (#1110). Stack: 1a → 1b → 2a → 2b (this) → 2c → DL → docs.

@rzo1

rzo1 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor
  • Builder.dashes() is plural while every other layer-enable method is singular (nfc, whitespace, caseFold, accentFold) and the enum constant is DASH. Suggest dash() for consistency.
  • Class javadoc @links NormalizationProfile#searchAnalyzer(), introduced only in OPENNLP-1850: Per-language NormalizationProfile registry (2c/7) #1112 - dangling link until that lands.
  • analyze(CharSequence) on a lemmatize()-configured analyzer throws at Term construction (eager LEMMA, null posTag), which reads against the javadoc phrasing that LEMMA "is not available from them." Either clarify the doc or skip unsatisfiable eager layers. Currently untested.

krickert added 2 commits June 25, 2026 13:16
The token analysis layer split out of the former tokenizer PR (#1104) on review request. A Term is
one token projected through the ordered Dimension stack (original, NFC, NFKC, whitespace, dash, case
fold, accent fold, confusable fold, stem, lemma), keeping its source Span and every intermediate
form; TermAnalyzer segments with the UAX #29 WordTokenizer (from 2a) and applies the configured
dimension prefix. Restores Dimension's {@link Term}/{@link TermAnalyzer} javadoc now that they exist.
Builds on the tokenizer in 2a.
…ften forward-link (Term)

Rename TermAnalyzer.Builder.dashes() -> dash() for consistency with the singular layer-enable methods
(nfc/whitespace/caseFold/accentFold) and the DASH enum. Clarify that analyze(CharSequence) fails loud
when a lemmatizer is configured (no POS tags) and add a test for it. Soften the NormalizationProfile
forward-link to {@code}.
@krickert krickert force-pushed the OPENNLP-1850-2a-tokenizer branch from 3fae8aa to f2d1d8c Compare June 25, 2026 17:25
@krickert krickert force-pushed the OPENNLP-1850-2b-term branch from 55dbeb4 to a23a513 Compare June 25, 2026 17:25
@krickert

Copy link
Copy Markdown
Contributor Author

@rzo1 All three addressed (tip a23a5135).

dashes()dash(). Renamed both overloads on TermAnalyzer.Builder for consistency with the singular layer-enable methods and the DASH enum. (TextNormalizer.Builder.dashes() keeps the plural — there it sits among quotes()/digits()/bullets(), so plural is the consistent choice in that class.) No callers needed updating.

Dangling searchAnalyzer() link. Softened the class-javadoc {@link NormalizationProfile#searchAnalyzer()} to {@code NormalizationProfile.searchAnalyzer()}, so 2b carries no dangling link to a type that lands in #1112.

analyze(CharSequence) + eager LEMMA. Clarified the doc and kept the fail-loud throw rather than silently skipping the layer. analyze(CharSequence) has no POS tags, so a configured LEMMA layer genuinely can't be computed from that entry point; silently dropping it would hide a misconfiguration (a lemmatizer was configured but never runs). The javadoc now states it throws when a lemmatizer is configured and points to analyze(tokens, tags) for the lemma path. Added testAnalyzeCharSequenceFailsLoudlyWhenLemmaConfigured covering it.

@rzo1 rzo1 requested review from atarora, jzonthemtn and mawiesne June 26, 2026 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants