charset: incorrect encoding for latin1
character set #18955
Closed
Description
opened on Aug 3, 2020
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
tidb> create table t(a varchar(10));
Query OK, 0 rows affected (0.16 sec)
case 1:
tidb> insert into t values ('¥');
Query OK, 1 row affected (0.04 sec)
tidb> select hex(a) from t;
case 2:
tidb> insert into t values ('中');
2. What did you expect to see? (Required)
tidb> select hex(a) from t;
+--------+
| hex(a) |
+--------+
| A5 |
+--------+
1 row in set (0.00 sec)
case 2:
mysql> insert into t values ('中');
ERROR 1366 (HY000): Incorrect string value: '\xE4\xB8\xAD' for column 'a' at row 1
3. What did you see instead (Required)
case 1:
tidb> select hex(a) from t;
+--------+
| hex(a) |
+--------+
| C2A5 |
+--------+
1 row in set (0.00 sec)
The encoding of ¥
in latin
should be A5
.
case 2:
tidb> insert into t values ('中');
Query OK, 1 row affected (0.01 sec)
4. Affected version (Required)
All versions of TiDB
5. Root Cause Analysis
In TiDB, we treat latin1
as a subset of utf8
/utf8mb4
and encoded the characters as UTF8, just like what we did for ascii
.
But, it is NOT: latin1
is a single-byte encoding character set:
- It supports 255 characters only
- for characters with codepoints in 128-255, the encoding is different with UTF8.
More details can be found here: https://en.wikipedia.org/wiki/ISO/IEC_8859-1
Activity