Skip to content

HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled [LUCENE-8183] #9231

Closed
@asfimport

Description

The HyphenationCompoundWordTokenFilter creates overlapping tokens even if onlyLongestMatch is enabled. 

Example:

Dictionary: gesellschaft, schaft
Hyphenator: de_DR.xml //from Apche Offo
onlyLongestMatch: true

 

text gesellschaft gesellschaft schaft
raw_bytes [67 65 73 65 6c 6c 73 63 68 61 66 74] [67 65 73 65 6c 6c 73 63 68 61 66 74] [73 63 68 61 66 74]
start 0 0 0
end 12 12 12
positionLength 1 1 1
type word word word
position 1 1 1

IMHO this includes 2 unexpected Tokens

  1. the 2nd 'gesellschaft' as it duplicates the original token
  2. the 'schaft' as it is a sub-token 'gesellschaft' that is present in the dictionary

Migrated from LUCENE-8183 by Rupert Westenthaler, 1 vote, updated Jan 14 2021
Environment:

Configuration of the analyzer:

<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.HyphenationCompoundWordTokenFilterFactory" 
        hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
         dictionary="lang/wordlist_de.txt" 
        onlyLongestMatch="true"/>

 

Attachments: LUCENE-8183_20180223_rwesten.diff, LUCENE-8183_20180227_rwesten.diff, lucene-8183.zip
Linked issues:

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions