Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mysql][polardb-x] Support all charsets for MySQL CDC Connector #1188

Merged
merged 0 commits into from
Aug 10, 2022

Conversation

ruanhang1993
Copy link
Contributor

This PR fixed the encoding bug when a mysql table uses a different charset encoding in the column level.

Changes:

  • Debezium MySqlValueConverters#charsetFor(Column) is reused to get the column charset.
  • For String type in mysql, first encoding with the jdbc connection charset(UTF_8). Then use the column charset.

@ruanhang1993
Copy link
Contributor Author

@fsk119 Please take a look. Thanks.
BTW, I only add four charset tests in this PR. Do we need to add all of the charsets in the mysql ?

@fsk119
Copy link
Member

fsk119 commented May 17, 2022

Thanks for your contribution. I find the MySqlConnectorITCase#testColumnOptionalWithDefaultValue fails. Could you verify whether the change break the test?

Do we need to add all of the charsets in the mysql ?

Yes. You should add all of the charsets.

@ruanhang1993
Copy link
Contributor Author

There a NullPointerException in my implementation. Now it should be fixed. @fsk119

Copy link
Member

@fsk119 fsk119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your great work. I leave some comments.

@ruanhang1993
Copy link
Contributor Author

I find that the charset properties are not using in the jdbc connection rightly.

This issue should wait #674 to be fixed. After setting the properties rightly, we could pass the new tests that use different charset.

@ruanhang1993
Copy link
Contributor Author

ruanhang1993 commented Jun 13, 2022

I find that the charset properties are not using in the jdbc connection rightly.

This issue should wait #674 to be fixed. After setting the properties rightly, we could pass the new tests that use different charset.

It seems that this PR is independent of #674, sorry for confusions.

Copy link
Member

@fsk119 fsk119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for update. I left some comments.

Copy link
Member

@fsk119 fsk119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@leonardBang leonardBang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ruanhang1993 for the contribution, I left some comments

@qidian99
Copy link
Contributor

qidian99 commented Aug 10, 2022

@ruanhang1993 Hi Ruan, I pm you some question regarding the details of this PR. I'll copy my question below -- could you also check your email?

IMHO, there are two charset transformations happening if the charset used in JDBC connection is different from that of table columns. For instance if the column uses latin1 and jdbc uses utf8, when we read the records the following transformations will occur:
latin1 -> utf8 -> bytes

So if we directly call getBytes and let the use convert it to latin1, the first utf8 transformation is missing. Therefore we should let jdbc handle charset conversion for us by calling getObject and underlyingly these transformations will happen:
bytes -> utf8 -> latin1

Please correct me if there's any misunderstanding.

@ruanhang1993
Copy link
Contributor Author

ruanhang1993 commented Aug 10, 2022

@ruanhang1993 Hi Ruan, I pm you some question regarding the details of this PR. I'll copy my question below -- could you also check your email?

IMHO, there are two charset transformations happening if the charset used in JDBC connection is different from that of table columns. For instance if the column uses latin1 and jdbc uses utf8, when we read the records the following transformations will occur: latin1 -> utf8 -> bytes

So if we directly call getBytes and let the use convert it to latin1, the first utf8 transformation is missing. Therefore we should let jdbc handle charset conversion for us by calling getObject and underlyingly these transformations will happen: bytes -> utf8 -> latin1

Please correct me if there's any misunderstanding.

I think there is something you misunderstand.

getObject only helps to complete the job that converts the value by a right connection charset(default utf-8) which is set by characterSetResults.The column charset latin1 only means the charset in the table column. The encoding of the returned result is depended by the characterSetResults.

This bug is that if we return a byte[] object which is UTF-8 encoding, the debezium will convert this value to String by a latin1 encoding. So we could return a String and the debezium will not do these processing.getObject will invoke getString for these char types finally. So we do not need the if statement by using it.

@leonardBang leonardBang changed the title [mysql] Use the right column charset in the snapshot phase. [mysql][polardb-x] Support all charsets for MySQL CDC Connector Aug 10, 2022
@leonardBang leonardBang merged commit 638474d into apache:master Aug 10, 2022
@leonardBang
Copy link
Contributor

@ruanhang1993 I open a new PR #1468 to fix the issue base on your work, please do not use you master branch as a develop branch which others can not append commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants