-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support export parquet format file #33633
Comments
@IANTHEREAL PTAL |
I've did some research on this, and before something deeper, I hope there would be a conclusion about which parquet lib we should use. There are some parquet lib in golang, for example, xitongsys/parquet-go, segmentio/parquet-go and apache parquet. I'm not familiar with any of them, is there any recommendation? |
@DCjanus thanks for your research. Sorry for replying so late. |
Previously, dumpling only needed to support text formats such as CSV and SQL, but parquet is a strongly typed file format with a scheme, that requires explicit declaration of whether a field can be null and whether it has a sign. The existing MySQL driver cannot meet this requirement (support for getting sign information has been added in Therefore, I think we need to call This is the only solution I can think of, but it seems to be relatively large. Do you have any other better suggestions?" |
@DCjanus Dumpling will use mysql> show create table `table`;
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| table | CREATE TABLE `table` (
`id` int(11) DEFAULT NULL,
`a1` varchar(11) DEFAULT NULL,
`a2` varchar(11) DEFAULT NULL,
`a3` varchar(11) DEFAULT NULL,
`a4` varchar(11) DEFAULT NULL,
`a5` varchar(11) DEFAULT NULL,
`a6` varchar(11) DEFAULT NULL,
`a7` varchar(11) DEFAULT NULL,
`a8` varchar(11) DEFAULT NULL,
`a9` varchar(11) DEFAULT NULL,
`a10` varchar(11) DEFAULT NULL,
UNIQUE KEY `id` (`id`),
KEY `a` (`a1`,`a8`),
KEY `b` (`a2`,`a3`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> show columns from `table`;
+-------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| id | int(11) | YES | UNI | NULL | |
| a1 | varchar(11) | YES | MUL | NULL | |
| a2 | varchar(11) | YES | MUL | NULL | |
| a3 | varchar(11) | YES | | NULL | |
| a4 | varchar(11) | YES | | NULL | |
| a5 | varchar(11) | YES | | NULL | |
| a6 | varchar(11) | YES | | NULL | |
| a7 | varchar(11) | YES | | NULL | |
| a8 | varchar(11) | YES | | NULL | |
| a9 | varchar(11) | YES | | NULL | |
| a10 | varchar(11) | YES | | NULL | |
+-------+-------------+------+-----+---------+-------+
11 rows in set (0.01 sec) |
To prevent users from misreading interrupted parquet files, we need to do some extra work. Two potential methods are:
Do you have any suggestions? |
@DCjanus, Sorry for reply so late. I'm on leave in the last few days.
This is definitely not okay. The cached parquet file might be too large which may cause dumpling OOM. I simply test this situation. I think we can't read this parquet file successfully if we don't correctly close this parquet file writer. I think this is enough. |
Feature Request
Is your feature request related to a problem? Please describe:
parquet is a compressed, efficient columnar data format. Lightning has already support load parquet files to TiDB in pingcap/tidb-lightning#373.
In our DBaaS tests, we found Aurora snapshot exportation is unexpectedly slow, so user may want to use dumpling to export from Aurora or other data source to export data into parquet format since parquet size is much smaller than SQL/CSV.
Describe the feature you'd like:
Dumpling support export data into parquet format just like SQL and CSV.
Describe alternatives you've considered:
Teachability, Documentation, Adoption, Optimization:
The text was updated successfully, but these errors were encountered: