Skip to content

charset: incorrect encoding for latin1 character set #18955

Closed
@bb7133

Description

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

tidb> create table t(a varchar(10));
Query OK, 0 rows affected (0.16 sec)

case 1:

tidb> insert into t values ('¥');
Query OK, 1 row affected (0.04 sec)
tidb> select hex(a) from t;

case 2:

tidb> insert into t values ('中');

2. What did you expect to see? (Required)

tidb> select hex(a) from t;
+--------+
| hex(a) |
+--------+
| A5   |
+--------+
1 row in set (0.00 sec)

case 2:

mysql> insert into t values ('中');
ERROR 1366 (HY000): Incorrect string value: '\xE4\xB8\xAD' for column 'a' at row 1

3. What did you see instead (Required)

case 1:

tidb> select hex(a) from t;
+--------+
| hex(a) |
+--------+
| C2A5   |
+--------+
1 row in set (0.00 sec)

The encoding of ¥ in latin should be A5.

case 2:

tidb> insert into t values ('中');
Query OK, 1 row affected (0.01 sec)

4. Affected version (Required)

All versions of TiDB

5. Root Cause Analysis

In TiDB, we treat latin1 as a subset of utf8/utf8mb4 and encoded the characters as UTF8, just like what we did for ascii.

But, it is NOT: latin1 is a single-byte encoding character set:

  1. It supports 255 characters only
  2. for characters with codepoints in 128-255, the encoding is different with UTF8.

More details can be found here: https://en.wikipedia.org/wiki/ISO/IEC_8859-1

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions