OPENNLP-1850: Layered Term model — Term, TermAnalyzer (2b/7)#1111
OPENNLP-1850: Layered Term model — Term, TermAnalyzer (2b/7)#1111krickert wants to merge 2 commits into
Conversation
a450069 to
dc02b9e
Compare
57e2b58 to
82cb041
Compare
dc02b9e to
dd1906d
Compare
82cb041 to
e35e859
Compare
dd1906d to
3fae8aa
Compare
e35e859 to
55dbeb4
Compare
|
The token analysis layer split out of the former tokenizer PR (#1104) on review request. A Term is one token projected through the ordered Dimension stack (original, NFC, NFKC, whitespace, dash, case fold, accent fold, confusable fold, stem, lemma), keeping its source Span and every intermediate form; TermAnalyzer segments with the UAX #29 WordTokenizer (from 2a) and applies the configured dimension prefix. Restores Dimension's {@link Term}/{@link TermAnalyzer} javadoc now that they exist. Builds on the tokenizer in 2a.
…ften forward-link (Term)
Rename TermAnalyzer.Builder.dashes() -> dash() for consistency with the singular layer-enable methods
(nfc/whitespace/caseFold/accentFold) and the DASH enum. Clarify that analyze(CharSequence) fails loud
when a lemmatizer is configured (no POS tags) and add a test for it. Soften the NormalizationProfile
forward-link to {@code}.
3fae8aa to
f2d1d8c
Compare
55dbeb4 to
a23a513
Compare
|
@rzo1 All three addressed (tip
Dangling
|
Part 2b of the OPENNLP-1850 stack: the token-analysis layer, split out of the former tokenizer PR (#1104) on review request.
A
Termis one token projected through the orderedDimensionstack (original, NFC, NFKC, whitespace, dash, case fold, accent fold, confusable fold, stem, lemma), keeping its sourceSpanand every intermediate form.TermAnalyzersegments with the UAX #29WordTokenizer(from 2a) and applies the configured dimension prefix. RestoresDimension's{@link Term}/{@link TermAnalyzer}javadoc now that those types exist.Base:
OPENNLP-1850-2a-tokenizer(#1110). Stack: 1a → 1b → 2a → 2b (this) → 2c → DL → docs.