Support export parquet format file #33633

glorv · 2021-01-08T10:26:44Z

Feature Request

Is your feature request related to a problem? Please describe:

parquet is a compressed, efficient columnar data format. Lightning has already support load parquet files to TiDB in pingcap/tidb-lightning#373.

In our DBaaS tests, we found Aurora snapshot exportation is unexpectedly slow, so user may want to use dumpling to export from Aurora or other data source to export data into parquet format since parquet size is much smaller than SQL/CSV.

Describe the feature you'd like:

Dumpling support export data into parquet format just like SQL and CSV.

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Optimization:

glorv · 2021-01-08T10:27:38Z

@IANTHEREAL PTAL

glorv · 2021-01-08T10:28:01Z

@overvenus @kennytm @lichunzhu

DCjanus · 2022-11-22T11:48:09Z

I've did some research on this, and before something deeper, I hope there would be a conclusion about which parquet lib we should use.

There are some parquet lib in golang, for example, xitongsys/parquet-go, segmentio/parquet-go and apache parquet.

I'm not familiar with any of them, is there any recommendation?

lichunzhu · 2022-11-24T07:49:59Z

I've did some research on this, and before something deeper, I hope there would be a conclusion about which parquet lib we should use.

There are some parquet lib in golang, for example, xitongsys/parquet-go, segmentio/parquet-go and apache parquet.

I'm not familiar with any of them, is there any recommendation?

@DCjanus thanks for your research. Sorry for replying so late.
I have checked these three library and prefer using apache parquet. It's more popular and has complete writer.
Here is an example https://github.com/apache/arrow/blob/9beb930/go/parquet/pqarrow/encode_arrow_test.go#L1332.

DCjanus · 2022-12-09T09:57:46Z

Previously, dumpling only needed to support text formats such as CSV and SQL, but parquet is a strongly typed file format with a scheme, that requires explicit declaration of whether a field can be null and whether it has a sign.

The existing MySQL driver cannot meet this requirement (support for getting sign information has been added in github.com/go-sql-driver/mysql v1.7, but it still cannot detect whether a field can be null).

Therefore, I think we need to call github.com/pingcap/tidb/parser to parse the create table statement and get more detailed field information in func MakeRowReceiver.

This is the only solution I can think of, but it seems to be relatively large. Do you have any other better suggestions?"

lichunzhu · 2022-12-09T12:56:31Z

@DCjanus Dumpling will use show columns from xx when it tries to dump a table and collecting its selected field. Can the result of Null or Default column satisfy the demands?

mysql> show create table `table`;
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| table | CREATE TABLE `table` (
  `id` int(11) DEFAULT NULL,
  `a1` varchar(11) DEFAULT NULL,
  `a2` varchar(11) DEFAULT NULL,
  `a3` varchar(11) DEFAULT NULL,
  `a4` varchar(11) DEFAULT NULL,
  `a5` varchar(11) DEFAULT NULL,
  `a6` varchar(11) DEFAULT NULL,
  `a7` varchar(11) DEFAULT NULL,
  `a8` varchar(11) DEFAULT NULL,
  `a9` varchar(11) DEFAULT NULL,
  `a10` varchar(11) DEFAULT NULL,
  UNIQUE KEY `id` (`id`),
  KEY `a` (`a1`,`a8`),
  KEY `b` (`a2`,`a3`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> show columns from `table`;
+-------+-------------+------+-----+---------+-------+
| Field | Type        | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| id    | int(11)     | YES  | UNI | NULL    |       |
| a1    | varchar(11) | YES  | MUL | NULL    |       |
| a2    | varchar(11) | YES  | MUL | NULL    |       |
| a3    | varchar(11) | YES  |     | NULL    |       |
| a4    | varchar(11) | YES  |     | NULL    |       |
| a5    | varchar(11) | YES  |     | NULL    |       |
| a6    | varchar(11) | YES  |     | NULL    |       |
| a7    | varchar(11) | YES  |     | NULL    |       |
| a8    | varchar(11) | YES  |     | NULL    |       |
| a9    | varchar(11) | YES  |     | NULL    |       |
| a10   | varchar(11) | YES  |     | NULL    |       |
+-------+-------------+------+-----+---------+-------+
11 rows in set (0.01 sec)

DCjanus · 2022-12-09T20:31:42Z

To prevent users from misreading interrupted parquet files, we need to do some extra work. Two potential methods are:

Allocate a large buffer in memory, and only write complete parquet files to external storage.
Write to an external storage stream with a file name like xxx.parquet.tmp, and then rename it to xxx.parquet once it is complete. For LocalFS, this is relatively cheap and can reduce memory usage, but for storage like S3, the rename process has additional overhead (it needs to be read into memory, written to the new file, and then the old file deleted), which may not be acceptable for large files or memory-constrained scenarios.

Do you have any suggestions?

lichunzhu · 2022-12-13T06:50:01Z

@DCjanus, Sorry for reply so late. I'm on leave in the last few days.

Allocate a large buffer in memory, and only write complete parquet files to external storage.

This is definitely not okay. The cached parquet file might be too large which may cause dumpling OOM.

I simply test this situation. I think we can't read this parquet file successfully if we don't correctly close this parquet file writer. I think this is enough.

kennytm added the priority/P2 The issue has P2 priority. label Jan 8, 2021

lichunzhu transferred this issue from pingcap/dumpling Mar 31, 2022

lichunzhu mentioned this issue Mar 31, 2022

Dumpling need support dump parquet format file #33626

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support export parquet format file #33633

Support export parquet format file #33633

glorv commented Jan 8, 2021

glorv commented Jan 8, 2021

glorv commented Jan 8, 2021

DCjanus commented Nov 22, 2022 •

edited

Loading

lichunzhu commented Nov 24, 2022 •

edited

Loading

DCjanus commented Dec 9, 2022 •

edited

Loading

lichunzhu commented Dec 9, 2022 •

edited

Loading

DCjanus commented Dec 9, 2022

lichunzhu commented Dec 13, 2022

Support export parquet format file #33633

Support export parquet format file #33633

Comments

glorv commented Jan 8, 2021

Feature Request

glorv commented Jan 8, 2021

glorv commented Jan 8, 2021

DCjanus commented Nov 22, 2022 • edited Loading

lichunzhu commented Nov 24, 2022 • edited Loading

DCjanus commented Dec 9, 2022 • edited Loading

lichunzhu commented Dec 9, 2022 • edited Loading

DCjanus commented Dec 9, 2022

lichunzhu commented Dec 13, 2022

DCjanus commented Nov 22, 2022 •

edited

Loading

lichunzhu commented Nov 24, 2022 •

edited

Loading

DCjanus commented Dec 9, 2022 •

edited

Loading

lichunzhu commented Dec 9, 2022 •

edited

Loading