[mysql][polardb-x] Support all charsets for MySQL CDC Connector #1188

ruanhang1993 · 2022-05-17T07:20:44Z

This PR fixed the encoding bug when a mysql table uses a different charset encoding in the column level.

Changes:

Debezium MySqlValueConverters#charsetFor(Column) is reused to get the column charset.
For String type in mysql, first encoding with the jdbc connection charset(UTF_8). Then use the column charset.

ruanhang1993 · 2022-05-17T07:22:03Z

@fsk119 Please take a look. Thanks.
BTW, I only add four charset tests in this PR. Do we need to add all of the charsets in the mysql ?

fsk119 · 2022-05-17T09:15:42Z

Thanks for your contribution. I find the MySqlConnectorITCase#testColumnOptionalWithDefaultValue fails. Could you verify whether the change break the test?

Do we need to add all of the charsets in the mysql ?

Yes. You should add all of the charsets.

ruanhang1993 · 2022-05-18T11:18:22Z

There a NullPointerException in my implementation. Now it should be fixed. @fsk119

fsk119

Thanks for your great work. I leave some comments.

...c/main/java/com/ververica/cdc/connectors/mysql/debezium/task/MySqlSnapshotSplitReadTask.java

...r-mysql-cdc/src/test/java/com/ververica/cdc/connectors/mysql/table/MySqlConnectorITCase.java

ruanhang1993 · 2022-05-23T02:51:29Z

I find that the charset properties are not using in the jdbc connection rightly.

This issue should wait #674 to be fixed. After setting the properties rightly, we could pass the new tests that use different charset.

ruanhang1993 · 2022-06-13T09:49:04Z

I find that the charset properties are not using in the jdbc connection rightly.

This issue should wait #674 to be fixed. After setting the properties rightly, we could pass the new tests that use different charset.

It seems that this PR is independent of #674， sorry for confusions.

fsk119

Thanks for update. I left some comments.

...c/main/java/com/ververica/cdc/connectors/mysql/debezium/task/MySqlSnapshotSplitReadTask.java

...-cdc/src/test/java/com/ververica/cdc/connectors/mysql/table/MysqlConnectorCharsetITCase.java

fsk119

LGTM

leonardBang

Thanks @ruanhang1993 for the contribution, I left some comments

flink-connector-mysql-cdc/src/test/resources/ddl/charset_test.sql

...-cdc/src/test/java/com/ververica/cdc/connectors/mysql/table/MysqlConnectorCharsetITCase.java

qidian99 · 2022-08-10T02:55:35Z

@ruanhang1993 Hi Ruan, I pm you some question regarding the details of this PR. I'll copy my question below -- could you also check your email?

IMHO, there are two charset transformations happening if the charset used in JDBC connection is different from that of table columns. For instance if the column uses latin1 and jdbc uses utf8, when we read the records the following transformations will occur:
latin1 -> utf8 -> bytes

So if we directly call getBytes and let the use convert it to latin1, the first utf8 transformation is missing. Therefore we should let jdbc handle charset conversion for us by calling getObject and underlyingly these transformations will happen:
bytes -> utf8 -> latin1

Please correct me if there's any misunderstanding.

ruanhang1993 · 2022-08-10T04:57:38Z

@ruanhang1993 Hi Ruan, I pm you some question regarding the details of this PR. I'll copy my question below -- could you also check your email?

IMHO, there are two charset transformations happening if the charset used in JDBC connection is different from that of table columns. For instance if the column uses latin1 and jdbc uses utf8, when we read the records the following transformations will occur: latin1 -> utf8 -> bytes

So if we directly call getBytes and let the use convert it to latin1, the first utf8 transformation is missing. Therefore we should let jdbc handle charset conversion for us by calling getObject and underlyingly these transformations will happen: bytes -> utf8 -> latin1

Please correct me if there's any misunderstanding.

I think there is something you misunderstand.

getObject only helps to complete the job that converts the value by a right connection charset(default utf-8) which is set by characterSetResults.The column charset latin1 only means the charset in the table column. The encoding of the returned result is depended by the characterSetResults.

This bug is that if we return a byte[] object which is UTF-8 encoding, the debezium will convert this value to String by a latin1 encoding. So we could return a String and the debezium will not do these processing.getObject will invoke getString for these char types finally. So we do not need the if statement by using it.

leonardBang · 2022-08-10T14:59:53Z

@ruanhang1993 I open a new PR #1468 to fix the issue base on your work, please do not use you master branch as a develop branch which others can not append commits.

ruanhang1993 force-pushed the master branch from 782c907 to 34d81be Compare May 18, 2022 08:34

fsk119 reviewed May 20, 2022

View reviewed changes

...c/main/java/com/ververica/cdc/connectors/mysql/debezium/task/MySqlSnapshotSplitReadTask.java Outdated Show resolved Hide resolved

...r-mysql-cdc/src/test/java/com/ververica/cdc/connectors/mysql/table/MySqlConnectorITCase.java Outdated Show resolved Hide resolved

ruanhang1993 mentioned this pull request May 23, 2022

mysql的连接无法设置一些参数 #674

Closed

ruanhang1993 force-pushed the master branch from 34d81be to 3a3f2a0 Compare June 13, 2022 09:47

fsk119 reviewed Jun 15, 2022

View reviewed changes

fsk119 approved these changes Jun 15, 2022

View reviewed changes

ruanhang1993 force-pushed the master branch from d87af39 to e4475dc Compare July 5, 2022 11:45

ruanhang1993 requested a review from fsk119 July 6, 2022 02:29

ruanhang1993 mentioned this pull request Jul 15, 2022

[mysql] fix mysql gbk garbled #1376

Closed

ruanhang1993 force-pushed the master branch from e4475dc to c33db91 Compare July 29, 2022 08:33

leonardBang reviewed Aug 3, 2022

View reviewed changes

ruanhang1993 force-pushed the master branch 4 times, most recently from 6724de2 to 348cac0 Compare August 10, 2022 02:24

leonardBang changed the title ~~[mysql] Use the right column charset in the snapshot phase.~~ [mysql][polardb-x] Support all charsets for MySQL CDC Connector Aug 10, 2022

leonardBang merged commit 638474d into apache:master Aug 10, 2022

leonardBang force-pushed the master branch from 311bf35 to 638474d Compare August 10, 2022 14:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mysql][polardb-x] Support all charsets for MySQL CDC Connector #1188

[mysql][polardb-x] Support all charsets for MySQL CDC Connector #1188

ruanhang1993 commented May 17, 2022

ruanhang1993 commented May 17, 2022

fsk119 commented May 17, 2022 •

edited

Loading

ruanhang1993 commented May 18, 2022

fsk119 left a comment

ruanhang1993 commented May 23, 2022

ruanhang1993 commented Jun 13, 2022 •

edited

Loading

fsk119 left a comment

fsk119 left a comment

leonardBang left a comment

qidian99 commented Aug 10, 2022 •

edited

Loading

ruanhang1993 commented Aug 10, 2022 •

edited

Loading

leonardBang commented Aug 10, 2022

[mysql][polardb-x] Support all charsets for MySQL CDC Connector #1188

[mysql][polardb-x] Support all charsets for MySQL CDC Connector #1188

Conversation

ruanhang1993 commented May 17, 2022

ruanhang1993 commented May 17, 2022

fsk119 commented May 17, 2022 • edited Loading

ruanhang1993 commented May 18, 2022

fsk119 left a comment

Choose a reason for hiding this comment

ruanhang1993 commented May 23, 2022

ruanhang1993 commented Jun 13, 2022 • edited Loading

fsk119 left a comment

Choose a reason for hiding this comment

fsk119 left a comment

Choose a reason for hiding this comment

leonardBang left a comment

Choose a reason for hiding this comment

qidian99 commented Aug 10, 2022 • edited Loading

ruanhang1993 commented Aug 10, 2022 • edited Loading

leonardBang commented Aug 10, 2022

fsk119 commented May 17, 2022 •

edited

Loading

ruanhang1993 commented Jun 13, 2022 •

edited

Loading

qidian99 commented Aug 10, 2022 •

edited

Loading

ruanhang1993 commented Aug 10, 2022 •

edited

Loading