Open
Description
Benchmarks on real data have steered me towards this token filter as other forms of stemmer are generally too aggressive for ecommerce (e.g. loafers==loaf
).
Good plural-stemming is ideally what is required because most user searches are plural and yet product descriptions are singular (e.g. "dresses" search should match product "red dress").
Good examples of plural stemming by this existing filter include:
Search string | Good stemmed form |
---|---|
cases |
case |
shades |
shade |
bottles |
bottle |
However, these terms fail to match because of bad stemming:
Search string | Bad stemmed form |
---|---|
dresses |
dresse |
watches |
watche |
brushes |
brushe |
boxes |
boxe |
Example reproduction:
DELETE test
PUT test
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"filter_english_minimal"
]
}
},
"filter": {
"filter_english_minimal": {
"type": "stemmer",
"name": "minimal_english"
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
POST test/_doc/1
{
"name":"red dress"
}
# Does not match (search stems to "dresse")
GET test/_search
{
"query":{
"match":{
"name":"dresses"
}
}
}
Solution
It would be good to fix these poor examples of stemming but would obviously need to worry about backwards compatibility.
Activity