Description
There is unconvenience when we use kuromoji_tokenizer
's search
mode and synonym_token_filter
toggether.
Problem Description
Now I have following analyzer setting,
{
"settings": {
"analysis": {
"tokenizer": {
"ja_tokenizer": {
"type": "kuromoji_tokenizer",
"mode": "search"
},
"filter": {
"my_synonym": {
"type": "synonym",
"synonyms": [
'焼きぎょうざ, 焼き餃子, 焼餃子'
]
}
},
"analyzer": {
"ja_analyzer": {
"type": "custom",
"tokenizer": "ja_tokenizer",
"filter": [
"my_synonym"
]
}
}
}
}
}
When I was going to create index from documents with above analyzer, following error is appeared.
/var/log/elasticearch/searcher.log
``` [2019-09-21T16:05:10,498][DEBUG][o.e.a.a.i.t.p.TransportPutIndexTemplateAction] [n01] failed to put template [products_template] java.lang.IllegalArgumentException: failed to build synonyms at org.elasticsearch.analysis.common.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:138) ~[?:?] at org.elasticsearch.analysis.common.SynonymTokenFilterFactory.getChainAwareTokenFilterFactory(SynonymTokenFilterFactory.java:90) ~[?:?] at org.elasticsearch.index.analysis.AnalyzerComponents.createComponents(AnalyzerComponents.java:84) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.index.analysis.CustomAnalyzerProvider.create(CustomAnalyzerProvider.java:63) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.index.analysis.CustomAnalyzerProvider.build(CustomAnalyzerProvider.java:50) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.index.analysis.AnalysisRegistry.produceAnalyzer(AnalysisRegistry.java:584) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.index.analysis.AnalysisRegistry.build(AnalysisRegistry.java:534) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.index.analysis.AnalysisRegistry.build(AnalysisRegistry.java:216) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.index.IndexService.(IndexService.java:180) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.index.IndexModule.newIndexService(IndexModule.java:411) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:563) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:512) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.metadata.MetaDataIndexTemplateService.validateAndAddTemplate(MetaDataIndexTemplateService.java:235) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.metadata.MetaDataIndexTemplateService.access$300(MetaDataIndexTemplateService.java:65) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.metadata.MetaDataIndexTemplateService$2.execute(MetaDataIndexTemplateService.java:176) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:687) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:310) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:210) [elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.3.2.jar:7.3.2] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:835) [?:?] Caused by: java.text.ParseException: Invalid synonym rule at line 4 at org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:325) ~[lucene-analyzers-common-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:35:41] at org.elasticsearch.analysis.common.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57) ~[?:?] at org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114) ~[lucene-analyzers-common-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:35:41] at org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70) ~[lucene-analyzers-common-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:35:41] at org.elasticsearch.analysis.common.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:134) ~[?:?] ... 27 more Caused by: java.lang.IllegalArgumentException: term: 焼餃子 analyzed to a token (焼餃子) with position increment != 1 (got: 0) at org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:325) ~[lucene-analyzers-common-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:35:41] at org.elasticsearch.analysis.common.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57) ~[?:?] at org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114) ~[lucene-analyzers-common-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:35:41] at org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70) ~[lucene-analyzers-common-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:35:41] at org.elasticsearch.analysis.common.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:134) ~[?:?] ... 27 more ```I researched about this error and I understand following points.
1. kuromoji_tokenizer
's search
mode outputs these three tokens for 焼餃子
.
curl -XPOST 'http://localhost:9200/my_index/_analyze?pretty' -H "Content-type: application/json" -d '{"analyzer": "ja_analyzer", "text": "焼餃子"}'
{
"tokens" : [
{
"token" : "焼",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "焼餃子",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0,
"positionLength" : 2
},
{
"token" : "餃子",
"start_offset" : 1,
"end_offset" : 3,
"type" : "word",
"position" : 1
}
]
}
2. ESSolrSynonymParser
checks position incrementation.
ESSolrSynonymParser.java#L55
SynonymMap.java
In such case above tokenized words "焼" and "焼餃子" have same position number, so ESSolrSynonymParser
fails because of position incrementation checking.
So I could create index with updating synonym content.
from:
"synonyms": [
'焼きぎょうざ, 焼き餃子, 焼餃子'
]
to:
"synonyms": [
'焼 きぎ ょうざ, 焼き餃子, 焼 餃子'
]
We need to tokenize synonym words and put space between each tokens.
My Opinion
I think many Japanese user want to use kuromoji_tokenizer
's search
mode and synonym_token_filter
toggether.
So I'm glad if analyzer is improved not to fail with them.
Or at least, writing document about it in the Elasticssearch document.
Because at now, I couldn't find good description about this problem and I could understand the reason with only after reading source code.
Elasticsearch version (bin/elasticsearch --version
):
7.2
Plugins installed: []
JVM version (java -version
):
openjdk 11.0.4 2019-07-16
OpenJDK Runtime Environment (build 11.0.4+11-post-Ubuntu-1ubuntu218.04.3)
OpenJDK 64-Bit Server VM (build 11.0.4+11-post-Ubuntu-1ubuntu218.04.3, mixed mode, sharing)
OS version (uname -a
if on a Unix-like system):
Ubuntu 18.04.3 LTS