Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lightning: parse bit string exported as Parquet from Aurora error #37774

Open
buchuitoudegou opened this issue Sep 13, 2022 · 5 comments
Open
Labels
affects-6.3 component/lightning This issue is related to Lightning of TiDB. may-affects-4.0 This bug maybe affects 4.0.x versions. may-affects-5.0 This bug maybe affects 5.0.x versions. may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.0 may-affects-6.1 may-affects-6.2 severity/moderate type/bug The issue is confirmed as a bug.

Comments

@buchuitoudegou
Copy link
Contributor

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. export table with schema from Aurora:
CREATE TABLE jxtest(
    `id` char(36) NOT NULL,
    `a` bigint unsigned NOT NULL,
    `aa` bigint signed NOT NULL,
    `b` int(11) unsigned NOT NULL,
    `bb` int(11) signed NOT NULL,
    `c` smallint signed NOT NULL,
    `cc` smallint unsigned NOT NULL,
    `d` tinyint signed NOT NULL,
    `dd` tinyint unsigned NOT NULL,
    `e` float unsigned NOT NULL,
    `ee` float signed NOT NULL,
    `f` VARCHAR(30) NOT NULL,
    `ff` TEXT NOT NULL,
    `h` MEDIUMTEXT NOT NULL,
    `hh` LONGTEXT NOT NULL,
    `ii` TINYTEXT NOT NULL,
    `j` DECIMAL NOT NULL,
    `jj` DECIMAL(8,0) NOT NULL,
    `k` DECIMAL(8,8) NOT NULL,
    `kk` DECIMAL(20,0) NOT NULL,
    `l` DECIMAL(20,8) NOT NULL,
    `ll` DECIMAL(36,0) NOT NULL,
    `m` DECIMAL(36,8) NOT NULL,
    `mm` DATE NOT NULL,
    `n` TIME NOT NULL,
    `nn` YEAR NOT NULL,
    `o` DATETIME NOT NULL,
    `oo` BINARY NOT NULL,
    `p` BLOB NOT NULL,
    `pp` LONGBLOB NOT NULL,
    `q` MEDIUMBLOB NOT NULL,
    `qq` TINYBLOB NOT NULL,
    `rr` BIT NOT NULL,
    `s` BOOLEAN NOT NULL,
    `ss` DOUBLE signed NOT NULL,
    `t` DOUBLE unsigned NOT NULL,
    PRIMARY KEY ( `id` ),
    KEY `index_a` (`a`) );
  1. import parquet using lightning

2. What did you expect to see? (Required)

All data the same as that in Aurora

3. What did you see instead (Required)

diff:
for binary type,
In Aurora: b'111111111'
In TiDB: 0x31313131313131313131

4. What is your TiDB version? (Required)

TiDB: v6.2.0
Lighting: v6.2.0

@buchuitoudegou buchuitoudegou added the type/bug The issue is confirmed as a bug. label Sep 13, 2022
@buchuitoudegou
Copy link
Contributor Author

/component lightning

@ti-chi-bot ti-chi-bot added the component/lightning This issue is related to Lightning of TiDB. label Sep 13, 2022
@buchuitoudegou
Copy link
Contributor Author

parse parquet using pyarrow:
image
In TiDB:

mysql> select p,pp,q from jxtest;
+--------------------------------------------------------------------------+------------------+----------------------------------------------------------------+
| p                                                                        | pp               | q                                                              |
+--------------------------------------------------------------------------+------------------+----------------------------------------------------------------+
| 0x313131313131313131                                                     | 0x31313131313131 | 0x313131313131313131                                           |
| 0x3131313131313131313131313131313131313131313131313131313131313131313131 | 0x31313131313131 | 0x313131313131313131313131313131313131313131313131313131313131 |
| 0x30                                                                     | 0x31313131313131 | 0x30                                                           |
+--------------------------------------------------------------------------+------------------+----------------------------------------------------------------+

@buchuitoudegou
Copy link
Contributor Author

The root cause is: parquet-go parses bit string as plain text string, i.e. b'1111111' => '1111111'

When convert it to string, we will see the result is correct:

mysql> select cast(p as char) from jxtest;
+-------------------------------------+
| cast(p as char)                     |
+-------------------------------------+
| 111111111                           |
| 11111111111111111111111111111111111 |
| 0                                   |
+-------------------------------------+
3 rows in set (0.01 sec)

The output of parquet-go:

schema element: SchemaElement({Type:BYTE_ARRAY TypeLength:<nil> RepetitionType:OPTIONAL Name:P NumChildren:<nil> ConvertedType:<nil> Scale:<nil> Precision:<nil> FieldID:<nil> LogicalType:<nil>}), string: 111111111
schema element: SchemaElement({Type:BYTE_ARRAY TypeLength:<nil> RepetitionType:OPTIONAL Name:Pp NumChildren:<nil> ConvertedType:<nil> Scale:<nil> Precision:<nil> FieldID:<nil> LogicalType:<nil>}), string: 1111111111
schema element: SchemaElement({Type:BYTE_ARRAY TypeLength:<nil> RepetitionType:OPTIONAL Name:Q NumChildren:<nil> ConvertedType:<nil> Scale:<nil> Precision:<nil> FieldID:<nil> LogicalType:<nil>}), string: 111111111

@lance6716
Copy link
Contributor

should we fix it in parquet-go instead?

@buchuitoudegou
Copy link
Contributor Author

should we fix it in parquet-go instead?

File an issue in parquet-go: xitongsys/parquet-go#496
But I don't have much time helping fixing it in the source repo😭

@ti-chi-bot ti-chi-bot added may-affects-4.0 This bug maybe affects 4.0.x versions. may-affects-5.0 This bug maybe affects 5.0.x versions. may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.0 may-affects-6.1 may-affects-6.2 may-affects-6.3 labels Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.3 component/lightning This issue is related to Lightning of TiDB. may-affects-4.0 This bug maybe affects 4.0.x versions. may-affects-5.0 This bug maybe affects 5.0.x versions. may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.0 may-affects-6.1 may-affects-6.2 severity/moderate type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants