Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

charset: implement utf8_unicode_ci and utf8mb4_unicode_ci collation (#18776) #22558

Merged
merged 9 commits into from
Jan 28, 2021

Conversation

ti-srebot
Copy link
Contributor

@ti-srebot ti-srebot commented Jan 27, 2021

cherry-pick #18776 to release-4.0
You can switch your code base to this Pull Request by using git-extras:

# In tidb repo:
git pr https://github.com/pingcap/tidb/pull/22558

After apply modifications, you can push your change to this PR via:

git push git@github.com:ti-srebot/tidb.git pr/22558:release-4.0-ba60cf5a69cf

What problem does this PR solve?

This is the second PR of #17596 . This PR is aim to implement utf8mb4_unicode_ci and utf8mb4_unicode_ci collation。

What is changed and how it works?

What is changed:

  • add a big lookup-table, created from DUCET

How it Works:

  • Implement UCA with unicode version 4.0.0 (same as mysql. see charset-unicode-sets-uca)
  • Mapping: each unicode char can be convert to collation weight(s) in different levels. It may one to one, one to many(expand), many to one(contractions), many to many (contractions)
  • UCA use four levels to compare two unicode strings, xxx_unicode_ci only use primary level (L1), it is ai and ci
  • mysql utf8mb4_unicode_ci does not support contractions (see charset-unicode-sets-uca)

benchmark

here is benchmark for Compare and Key function

goos: linux
goarch: amd64
pkg: github.com/pingcap/tidb/util/collate
BenchmarkUtf8mb4Bin_CompareShort-8              175546567                7.12 ns/op
BenchmarkUtf8mb4GeneralCI_CompareShort-8         2656393               453 ns/op
BenchmarkUtf8mb4UnicodeCI_CompareShort-8         2521083               409 ns/op
BenchmarkUtf8mb4Bin_CompareMid-8                160179999                7.36 ns/op
BenchmarkUtf8mb4GeneralCI_CompareMid-8             31842             36862 ns/op
BenchmarkUtf8mb4UnicodeCI_CompareMid-8             32714             37254 ns/op
BenchmarkUtf8mb4Bin_CompareLong-8               159598683                7.38 ns/op
BenchmarkUtf8mb4GeneralCI_CompareLong-8               28          40149168 ns/op
BenchmarkUtf8mb4UnicodeCI_CompareLong-8               30          39463787 ns/op
BenchmarkUtf8mb4Bin_KeyShort-8                  25229853                45.0 ns/op
BenchmarkUtf8mb4GeneralCI_KeyShort-8             3883850               266 ns/op
BenchmarkUtf8mb4UnicodeCI_KeyShort-8             4620993               254 ns/op
BenchmarkUtf8mb4Bin_KeyMid-8                     1491789               791 ns/op
BenchmarkUtf8mb4GeneralCI_KeyMid-8                 52681             22532 ns/op
BenchmarkUtf8mb4UnicodeCI_KeyMid-8                 55576             21164 ns/op
BenchmarkUtf8mb4Bin_KeyLong-8                       3147            505012 ns/op
BenchmarkUtf8mb4GeneralCI_KeyLong-8                   42          28282321 ns/op
BenchmarkUtf8mb4UnicodeCI_KeyLong-8                   57          21806637 ns/op

string length short 32 Mid 2048 Long 2097152

Tests

  • Unit test
  • Integration test

Release note

  • Implement utf8_unicode_ci and utf8mb4_unicode_ci collation

Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
@ti-srebot
Copy link
Contributor Author

/run-all-tests

@ti-srebot
Copy link
Contributor Author

@xiongjiwei you're already a collaborator in bot's repo.

@bb7133
Copy link
Member

bb7133 commented Jan 27, 2021

Please fix the conflict.

@xiongjiwei
Copy link
Contributor

/run-all-tests

@xiongjiwei
Copy link
Contributor

/run-all-tests

Copy link
Member

@wjhuang2016 wjhuang2016 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Jan 28, 2021
@bb7133
Copy link
Member

bb7133 commented Jan 28, 2021

LGTM

@ti-srebot ti-srebot removed the status/LGT1 Indicates that a PR has LGTM 1. label Jan 28, 2021
@ti-srebot ti-srebot added the status/LGT2 Indicates that a PR has LGTM 2. label Jan 28, 2021
@bb7133
Copy link
Member

bb7133 commented Jan 28, 2021

/merge

@ti-srebot ti-srebot added the status/can-merge Indicates a PR has been approved by a committer. label Jan 28, 2021
@ti-srebot
Copy link
Contributor Author

/run-all-tests

@ti-srebot
Copy link
Contributor Author

@ti-srebot merge failed.

@qw4990
Copy link
Contributor

qw4990 commented Jan 28, 2021

Wait for #21877.

@qw4990
Copy link
Contributor

qw4990 commented Jan 28, 2021

/run-all-tests

1 similar comment
@qw4990
Copy link
Contributor

qw4990 commented Jan 28, 2021

/run-all-tests

@qw4990
Copy link
Contributor

qw4990 commented Jan 28, 2021

/run-integration-copr-test

@xiongjiwei
Copy link
Contributor

/run-unit-test

@qw4990 qw4990 merged commit 1f5b303 into pingcap:release-4.0 Jan 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/executor component/expression contribution This PR is from a community contributor. sig/execution SIG execution sig/sql-infra SIG: SQL Infra status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2. type/4.0-cherry-pick
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants