Skip to content

predicate_token_filter : The token.getPosition() method return wrong value #47197

Closed
@pierremalletneo9

Description

@pierremalletneo9

Elasticsearch version :
7.3.2 dockerized
image: docker.elastic.co/elasticsearch/elasticsearch:7.3.2

Plugins installed: none

JVM version (java -version):

openjdk version "12.0.2" 2019-07-16
OpenJDK Runtime Environment (build 12.0.2+10)
OpenJDK 64-Bit Server VM (build 12.0.2+10, mixed mode, sharing)

OS version (uname -a if on a Unix-like system):

Linux 7f94601adc38 4.20.7-042007-generic #201902061234 SMP Wed Feb 6 17:36:40 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
I'm using a predicate_token_filter to keep only the first X token of a stream. For this I use this filter configuration :

{ "myPredicatefilter": { "type": "predicate_token_filter", "script": { "source": "token.getPosition() <= 1" } } }

But every time I use this analyzer it seems the position of the tokens are increasing, and after a few calls, the filter does not produce any token.

Here is a video showing the problem :

Steps to reproduce:

Complete index settings :

PUT issue-predicate-token-filter
{
  "settings": {
    "analysis": {
      "filter": {
        "myPredicatefilter": {
          "type": "predicate_token_filter",
          "script": {
            "source": "token.getPosition() <= 1"
          }
        }
      },
      "analyzer": {
        "myPredicateAnalyzer": {
          "filter": [
            "myPredicatefilter"
          ],
          "type": "custom",
          "tokenizer": "whitespace"
        }
      }
    }
  }
}

analyze request :

POST issue-predicate-token-filter/_analyze
{
  "analyzer": "myPredicateAnalyzer",
  "text": "pain grillé"
}

first result :

{
  "tokens" : [
    {
      "token" : "pain",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "grillé",
      "start_offset" : 5,
      "end_offset" : 11,
      "type" : "word",
      "position" : 1
    }
  ]
}


second result and all call afterward :

{
  "tokens" : [ ]
}

The analyzer can work again for a call after a _close / _open in the index. And also if I use explain : true in the analyze request the analyzer works without any problem.

POST issue-predicate-token-filter/_analyze
{
  "analyzer": "myPredicateAnalyzer",
  "text": "pain grillé",
  "explain": true
}

You can see the weird behavior by adding a Debug.explain in the filter script

PUT issue-predicate-token-filter-with-debug
{
  "settings": {
    "analysis": {
      "filter": {
        "myPredicatefilter": {
          "type": "predicate_token_filter",
          "script": {
            "source": "Debug.explain(token.getPosition())"
          }
        }
      },
      "analyzer": {
        "myPredicateAnalyzer": {
          "filter": [
            "myPredicatefilter"
          ],
          "type": "custom",
          "tokenizer": "whitespace"
        }
      }
    }
  }
}
POST issue-predicate-token-filter-with-debug/_analyze
{
  "analyzer": "myPredicateAnalyzer",
  "text": "pain grillé",
  "explain": false
}

You will see the token.getPosition() value increasing after each call.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions