CJK analyzer tokenization issues

It makes sense to me to report these all together, but I can split these into separate bugs if that's better.



**Elasticsearch version** (`curl -XGET 'localhost:9200'`):
```
{
  "name" : "adOS8gy",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "GVS7gpVBQDGwtHl3xnJbLw",
  "version" : {
    "number" : "6.4.0",
    "build_flavor" : "default",
    "build_type" : "deb",
    "build_hash" : "595516e",
    "build_date" : "2018-08-17T23:18:47.308994Z",
    "build_snapshot" : false,
    "lucene_version" : "7.4.0",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}
```

**Plugins installed**: [analysis-icu, analysis-nori]

**JVM version** (`java -version`):
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

**OS version** (`uname -a` if on a Unix-like system):
Linux vagrantes6 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) x86_64 GNU/Linux

**Description of the problem including expected versus actual behavior**:

I've uncovered a number of oddities in tokenization in the CJK analyzer. All examples are from Korean Wikipedia or Korean Wiktionary (including non-CJK examples). In rough order of importance:

A. Mixed-script tokens (Korean and non-CJK—such as numbers, Latin characters) are treated as one long token, rather than being broken up into bigrams. For example, `안녕은하철도999극장판2.1981년8월8일.일본개봉작1999년재더빙video판` is tokenized as one token.

B. Middle dots (·, U+00B7) can be used as list separators in Korean. When they are, the text is not broken up into bigrams. For example, `경승지·산악·협곡·해협·곶·심연·폭포·호수·급류` is tokenized as one token. I'm not sure whether this is a special case of (A) or not.

Work around: use a character filter to convert middle dots to spaces before CJK.

C. The CJK analyzer eats encircled numbers (①②③), "dingbat" circled numbers (➀➁➂), parenthesized numbers (⑴⑵⑶), fractions (¼ ⅓ ⅜ ½ ⅔ ¾), superscript numbers (¹²³), and subscript numbers (₁₂₃). They just disappear.

Work around: use the icu_normalizer before CJK to convert these to ASCII numbers.

D. Soft hyphens (U+00AD) and zero-width non-joiners (U+200C), and left-to-right and right-to-left markers (U+200E and U+200F) are left in tokens. They should be stripped out. Examples: hyphen­ation (soft hyphen) and بازی‌های (zero-width non-joiners), הארץ‎ (left-to-right mark).

Work around: use a character filter to strip these characters before CJK.

**Steps to reproduce**:

Please include a *minimal* but *complete* recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc.  The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

 1. Set up CJK analyzer:
 
```
curl -X PUT "localhost:9200/cjk?pretty" -H 'Content-Type: application/json' -d'
{
  "settings" : {
    "index": {
      "analysis": {
        "analyzer": {
          "text": {
            "type": "cjk"
          }
        }
      }
    }
  }
}
'
```

 2. Analyze example tokens:

A. Mixed Korean–Non-CJK characters

```
curl -sk localhost:9200/cjk/_analyze?pretty -H 'Content-Type: application/json' -d '{"analyzer": "text", "text" : "안녕은하철도999극장판2.1981년8월8일.일본개봉작1999년재더빙video판"}'

{
  "tokens" : [
    {
      "token" : "안녕은하철도999극장판2.1981년8월8일.일본개봉작1999년재더빙video판",
      "start_offset" : 0,
      "end_offset" : 43,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}
```


B. Middle dots as lists

```
curl -sk localhost:9200/cjk/_analyze?pretty -H 'Content-Type: application/json' -d '{"analyzer": "text", "text" : "경승지·산악·협곡·해협·곶·심연·폭포·호수·급류"}'

{
  "tokens" : [
    {
      "token" : "경승지·산악·협곡·해협·곶·심연·폭포·호수·급류",
      "start_offset" : 0,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}
```


C. Unicode numerical characters disappear
```
curl -sk localhost:9200/cjk/_analyze?pretty -H 'Content-Type: application/json' -d '{"analyzer": "text", "text" : "① ② ③ ➀ ➁ ➂ ⑴ ⑵ ⑶ ¼ ⅓ ⅜ ½ ⅔ ¾ ¹ ² ³ ₁ ₂ ₃"}'

{
  "tokens" : [ ]
}
```

D. soft hyphens, zero-width non-joiners, left-to-right and right-to-left markers (note that these are usually invisible)

```
curl -sk localhost:9200/cjk/_analyze?pretty -H 'Content-Type: application/json' -d '{"analyzer": "text", "text" : "hyphen­ation"}'

{
  "tokens" : [
    {
      "token" : "hyphen­ation",
      "start_offset" : 0,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

curl -sk localhost:9200/cjk/_analyze?pretty -H 'Content-Type: application/json' -d '{"analyzer": "text", "text" : "بازی‌های"}'

{
  "tokens" : [
    {
      "token" : "بازی‌های",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

curl -sk localhost:9200/cjk/_analyze?pretty -H 'Content-Type: application/json' -d '{"analyzer": "text", "text" : "הארץ‎"}'

{
  "tokens" : [
    {
      "token" : "הארץ‎",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CJK analyzer tokenization issues #34285

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CJK analyzer tokenization issues #34285

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions