title | summary | category | aliases | |||
---|---|---|---|---|---|---|
Character Set and Collation |
Learn about the supported character sets and collations in TiDB. |
reference |
|
A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set.
Currently, TiDB supports the following character sets:
{{< copyable "sql" >}}
SHOW CHARACTER SET;
+---------|---------------|-------------------|--------+
| Charset | Description | Default collation | Maxlen |
+---------|---------------|-------------------|--------+
| utf8 | UTF-8 Unicode | utf8_bin | 3 |
| utf8mb4 | UTF-8 Unicode | utf8mb4_bin | 4 |
| ascii | US ASCII | ascii_bin | 1 |
| latin1 | Latin1 | latin1_bin | 1 |
| binary | binary | binary | 1 |
+---------|---------------|-------------------|--------+
5 rows in set (0.00 sec)
Note:
Each character set might correspond to multiple collations, but by default each character set corresponds to only one collation.
You can use the following statement to view the collations (under the new framework for collations) that corresponds to the character set.
{{< copyable "sql" >}}
SHOW COLLATION WHERE Charset = 'utf8mb4';
+--------------------+---------+------+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+--------------------+---------+------+---------+----------+---------+
| utf8mb4_bin | utf8mb4 | 46 | Yes | Yes | 1 |
| utf8mb4_general_ci | utf8mb4 | 45 | | Yes | 1 |
+--------------------+---------+------+---------+----------+---------+
2 rows in set (0.00 sec)
Not supported yet.
Each database has a character set and a collation. You can use the following statements to specify the database character set and collation:
CREATE DATABASE db_name
[[DEFAULT] CHARACTER SET charset_name]
[[DEFAULT] COLLATE collation_name]
ALTER DATABASE db_name
[[DEFAULT] CHARACTER SET charset_name]
[[DEFAULT] COLLATE collation_name]
DATABASE
can be replaced with SCHEMA
here.
Different databases can use different character sets and collations. Use the character_set_database
and collation_database
to see the character set and collation of the current database:
{{< copyable "sql" >}}
create schema test1 character set utf8mb4 COLLATE uft8mb4_general_ci;
Query OK, 0 rows affected (0.09 sec)
{{< copyable "sql" >}}
use test1;
Database changed
{{< copyable "sql" >}}
SELECT @@character_set_database, @@collation_database;
+--------------------------|----------------------+
| @@character_set_database | @@collation_database |
+--------------------------|----------------------+
| utf8mb4 | uft8mb4_general_ci |
+--------------------------|----------------------+
1 row in set (0.00 sec)
{{< copyable "sql" >}}
create schema test2 character set latin1 COLLATE latin1_bin;
Query OK, 0 rows affected (0.09 sec)
{{< copyable "sql" >}}
use test2;
Database changed
{{< copyable "sql" >}}
SELECT @@character_set_database, @@collation_database;
+--------------------------|----------------------+
| @@character_set_database | @@collation_database |
+--------------------------|----------------------+
| latin1 | latin1_bin |
+--------------------------|----------------------+
1 row in set (0.00 sec)
You can also see the two values in INFORMATION_SCHEMA
:
{{< copyable "sql" >}}
SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME
FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = 'db_name';
You can use the following statement to specify the character set and collation for tables:
CREATE TABLE tbl_name (column_list)
[[DEFAULT] CHARACTER SET charset_name]
[COLLATE collation_name]]
ALTER TABLE tbl_name
[[DEFAULT] CHARACTER SET charset_name]
[COLLATE collation_name]
For example:
{{< copyable "sql" >}}
CREATE TABLE t1(a int) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
Query OK, 0 rows affected (0.08 sec)
If the table character set and collation are not specified, the database character set and collation are used as their default values.
You can use the following statement to specify the character set and collation for columns:
col_name {CHAR | VARCHAR | TEXT} (col_length)
[CHARACTER SET charset_name]
[COLLATE collation_name]
col_name {ENUM | SET} (val_list)
[CHARACTER SET charset_name]
[COLLATE collation_name]
If the column character set and collation are not specified, the table character set and collation are used as their default values.
Each string corresponds to a character set and a collation. When you use a string, this option is available:
{{< copyable "sql" >}}
[_charset_name]'string' [COLLATE collation_name]
Example:
{{< copyable "sql" >}}
SELECT 'string';
SELECT _utf8mb4'string';
SELECT _utf8mb4'string' COLLATE utf8mb4_general_ci;
Rules:
- Rule 1: If you specify
CHARACTER SET charset_name
andCOLLATE collation_name
, then thecharset_name
character set and thecollation_name
collation are used directly. - Rule 2: If you specify
CHARACTER SET charset_name
but do not specifyCOLLATE collation_name
, thecharset_name
character set and the default collation ofcharset_name
are used. - Rule 3: If you specify neither
CHARACTER SET charset_name
norCOLLATE collation_name
, the character set and collation given by the system variablescharacter_set_connection
andcollation_connection
are used.
-
The server character set and collation are the values of the
character_set_server
andcollation_server
system variables. -
The character set and collation of the default database are the values of the
character_set_database
andcollation_database
system variables.
You can use character_set_connection
and collation_connection
to specify the character set and collation for each connection. The character_set_client
variable is to set the client character set.
Before returning the result, the character_set_results
system variable indicates the character set in which the server returns query results to the client, including the metadata of the result.
You can use the following statement to set the character set and collation that is related to the client:
-
SET NAMES 'charset_name' [COLLATE 'collation_name']
SET NAMES
indicates what character set the client will use to send SQL statements to the server.SET NAMES utf8mb4
indicates that all the requests from the client use utf8mb4, as well as the results from the server.The
SET NAMES 'charset_name'
statement is equivalent to the following statement combination:SET character_set_client = charset_name; SET character_set_results = charset_name; SET character_set_connection = charset_name;
COLLATE
is optional, if absent, the default collation of thecharset_name
is used. -
SET CHARACTER SET 'charset_name'
Similar to
SET NAMES
, theSET NAMES 'charset_name'
statement is equivalent to the following statement combination:SET character_set_client = charset_name; SET character_set_results = charset_name; SET collation_connection = @@collation_database;
String > Column > Table > Database > Server > Cluster
- Rule 1: If you specify
CHARACTER SET charset_name
andCOLLATE collation_name
, then thecharset_name
character set and thecollation_name
collation are used directly. - Rule 2: If you specify
CHARACTER SET charset_name
and do not specifyCOLLATE collation_name
, then thecharset_name
character set and the default collation ofcharset_name
are used. - Rule 3: If you specify neither
CHARACTER SET charset_name
norCOLLATE collation_name
, the character set and collation with higher optimization levels are used.
If the specified character set is utf8
or utf8mb4
, TiDB only supports the valid utf8
characters. For invalid characters, TiDB reports the incorrect utf8 value
error. This validity check of characters in TiDB is compatible with MySQL 8.0 but incompatible with MySQL 5.7 or earlier versions.
To disable this error reporting, use set @@tidb_skip_utf8_check=1;
to skip the character check.
The syntax support and semantic support for the collation are influenced by the new_collations_enabled_on_first_bootstrap
configuration item. The syntax support and semantic support are different. The former indicates that TiDB can parse and set collations. The latter indicates that TiDB can correctly use collations when comparing strings.
Before v4.0, TiDB provides only the old framework for collations. In this framework, TiDB supports syntactically parsing most of the MySQL collations but semantically takes all collations as binary collations.
Since v4.0, TiDB supports a new framework for collations. In this framework, TiDB semantically parses different collations and strictly follows the collations when comparing strings.
Before v4.0, you can specify most of the MySQL collations in TiDB, and these collations are processed according to the default collations, which means that the byte order determines the character order. Different from MySQL, TiDB deletes the space at the end of the character according to the PADDING
attribute of the collation before comparing characters, which causes the following behavior differences:
{{< copyable "sql" >}}
create table t(a varchar(20) charset utf8mb4 collate utf8mb4_general_ci primary key);
Query OK, 0 rows affected
insert into t values ('A');
Query OK, 1 row affected
insert into t values ('a');
Query OK, 1 row affected # In MySQL, because utf8mb4_general_ci is case-insensitive, the `Duplicate entry 'a'` error is reported.
insert into t1 values ('a ');
Query OK, 1 row affected # In MySQL, because comparison is performed after the spaces are filled in, the `Duplicate entry 'a '` error is returned.
In TiDB 4.0, a complete framework for collations is introduced. This new framework supports semantically parsing collations and introduces the new_collations_enabled_on_first_bootstrap
configuration item to decide whether to enable the new framework when a cluster is first initialized. If you initialize the cluster after the configuration item is enabled, you can check whether the new collation is enabled through the new_collation_enabled
variable in the mysql
.tidb
table:
{{< copyable "sql" >}}
select VARIABLE_VALUE from mysql.tidb where VARIABLE_NAME='new_collation_enabled';
+----------------+
| VARIABLE_VALUE |
+----------------+
| True |
+----------------+
1 row in set (0.00 sec)
Under the new framework, TiDB support the utf8_general_ci
and utf8mb4_general_ci
collations which are compatible with MySQL.
When utf8_general_ci
or utf8mb4_general_ci
is used, the string comparison is case-insensitive and accent-insensitive. At the same time, TiDB also corrects the collation's PADDING
behavior:
{{< copyable "sql" >}}
create table t(a varchar(20) charset utf8mb4 collate utf8mb4_general_ci primary key);
Query OK, 0 rows affected (0.00 sec)
insert into t values ('A');
Query OK, 1 row affected (0.00 sec)
insert into t values ('a');
ERROR 1062 (23000): Duplicate entry 'a' for key 'PRIMARY'
insert into t values ('a ');
ERROR 1062 (23000): Duplicate entry 'a ' for key 'PRIMARY'
Note:
The implementation of padding in TiDB is different from that in MySQL. In MySQL, padding is implemented by filling in spaces. In TiDB, padding is implemented by cutting out the spaces at the end. The two approaches are the same in most cases. The only exception is when the end of the string contains characters that are less than spaces (0x20). For example, the result of
'a' < 'a\t'
in TiDB is1
, but in MySQL,'a' < 'a\t'
is equivalent to'a ' < 'a\t'
, and the result is0
.
If an expression involves multiple clauses of different collations, you need to infer the collation used in the calculation. The rules are as follows:
- The coercibility value of the explicit
COLLATE
clause is0
. - If the collations of two strings are incompatible, the coercibility value of the concatenation of two strings with different collations is
1
. Currently, all implemented collations are compatible with each other. - The column's collation has a coercibility value of
2
. - The system constant (the string returned by
USER ()
orVERSION ()
) has a coercibility value of3
. - The coercibility value of constants is
4
. - The coercibility value of numbers or intermediate variables is
5
. NULL
or expressions derived fromNULL
has a coercibility value of6
.
When inferring collations, TiDB prefers using the collation of expressions with lower coercibility values (the same as MySQL). If the coercibility values of two clauses are the same, the collation is determined according to the following priority:
binary > utf8mb4_bin > utf8mb4_general_ci > utf8_bin > utf8_general_ci > latin1_bin > ascii_bin
If the collations of two clauses are different and the coercibility value of both clauses is 0
, TiDB cannot infer the collation and reports an error.
TiDB supports using the COLLATE
clause to specify the collation of an expression. The coercibility value of this expression is 0
, which has the highest priority. See the following example:
{{< copyable "sql" >}}
select 'a' = 'A' collate utf8mb4_general_ci;
+--------------------------------------+
| 'a' = 'A' collate utf8mb4_general_ci |
+--------------------------------------+
| 1 |
+--------------------------------------+
1 row in set (0.00 sec)
For more details, see Connection Character Sets and Collations.