Skip to content

English-minimal analyzer has bad plural stemming #42892

Open
@markharwood

Description

Benchmarks on real data have steered me towards this token filter as other forms of stemmer are generally too aggressive for ecommerce (e.g. loafers==loaf).
Good plural-stemming is ideally what is required because most user searches are plural and yet product descriptions are singular (e.g. "dresses" search should match product "red dress").

Good examples of plural stemming by this existing filter include:

Search string Good stemmed form
cases case
shades shade
bottles bottle

However, these terms fail to match because of bad stemming:

Search string Bad stemmed form
dresses dresse
watches watche
brushes brushe
boxes boxe

Example reproduction:

DELETE test
PUT test
{
  "settings": {
	"number_of_shards": 1,
	"number_of_replicas": 0,
	"analysis": {
	  "analyzer": {
		"my_analyzer": {
		  "tokenizer": "standard",
		  "filter": [
			"lowercase",
			"filter_english_minimal"
		  ]
		}
	  },
	  "filter": {
		"filter_english_minimal": {
		  "type": "stemmer",
		  "name": "minimal_english"
		}
	  }
	}
  },
  "mappings": {
	"_doc": {
	  "properties": {
		"name": {
		  "type": "text",
		  "analyzer": "my_analyzer"
		}
	  }
	}
  }
}


POST test/_doc/1
{
  "name":"red dress"  
}

# Does not match (search stems to "dresse")
GET test/_search
{
  "query":{
	"match":{
	  "name":"dresses"
	}
  }
}

Solution

It would be good to fix these poor examples of stemming but would obviously need to worry about backwards compatibility.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    :Search Relevance/AnalysisHow text is split into tokens>bugTeam:Search RelevanceMeta label for the Search Relevance team in Elasticsearchpriority:normalA label for assessing bug priority to be used by ES engineers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions