HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled [LUCENE-8183] #9231
Closed
Description
The HyphenationCompoundWordTokenFilter creates overlapping tokens even if onlyLongestMatch is enabled.
Example:
Dictionary: gesellschaft
, schaft
Hyphenator: de_DR.xml
//from Apche Offo
onlyLongestMatch: true
text | gesellschaft | gesellschaft | schaft |
---|---|---|---|
raw_bytes | [67 65 73 65 6c 6c 73 63 68 61 66 74] | [67 65 73 65 6c 6c 73 63 68 61 66 74] | [73 63 68 61 66 74] |
start | 0 | 0 | 0 |
end | 12 | 12 | 12 |
positionLength | 1 | 1 | 1 |
type | word | word | word |
position | 1 | 1 | 1 |
IMHO this includes 2 unexpected Tokens
- the 2nd 'gesellschaft' as it duplicates the original token
- the 'schaft' as it is a sub-token 'gesellschaft' that is present in the dictionary
Migrated from LUCENE-8183 by Rupert Westenthaler, 1 vote, updated Jan 14 2021
Environment:
Configuration of the analyzer:
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.HyphenationCompoundWordTokenFilterFactory"
hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
dictionary="lang/wordlist_de.txt"
onlyLongestMatch="true"/>
Attachments: LUCENE-8183_20180223_rwesten.diff, LUCENE-8183_20180227_rwesten.diff, lucene-8183.zip
Linked issues: