Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support export parquet format file #33633

Open
glorv opened this issue Jan 8, 2021 · 8 comments
Open

Support export parquet format file #33633

glorv opened this issue Jan 8, 2021 · 8 comments
Labels
component/dumpling This is related to Dumpling of TiDB. feature/discussing This feature request is discussing among product managers priority/P2 The issue has P2 priority. type/feature-request Categorizes issue or PR as related to a new feature.

Comments

@glorv
Copy link
Contributor

glorv commented Jan 8, 2021

Feature Request

Is your feature request related to a problem? Please describe:

parquet is a compressed, efficient columnar data format. Lightning has already support load parquet files to TiDB in pingcap/tidb-lightning#373.

In our DBaaS tests, we found Aurora snapshot exportation is unexpectedly slow, so user may want to use dumpling to export from Aurora or other data source to export data into parquet format since parquet size is much smaller than SQL/CSV.

Describe the feature you'd like:

Dumpling support export data into parquet format just like SQL and CSV.

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Optimization:

@glorv
Copy link
Contributor Author

glorv commented Jan 8, 2021

@IANTHEREAL PTAL

@glorv
Copy link
Contributor Author

glorv commented Jan 8, 2021

@kennytm kennytm added the priority/P2 The issue has P2 priority. label Jan 8, 2021
@lichunzhu lichunzhu transferred this issue from pingcap/dumpling Mar 31, 2022
@lichunzhu lichunzhu added component/dumpling This is related to Dumpling of TiDB. priority/P2 The issue has P2 priority. type/feature-request Categorizes issue or PR as related to a new feature. feature/discussing This feature request is discussing among product managers and removed priority/P2 The issue has P2 priority. labels Mar 31, 2022
@DCjanus
Copy link

DCjanus commented Nov 22, 2022

I've did some research on this, and before something deeper, I hope there would be a conclusion about which parquet lib we should use.

There are some parquet lib in golang, for example, xitongsys/parquet-go, segmentio/parquet-go and apache parquet.

I'm not familiar with any of them, is there any recommendation?

@lichunzhu
Copy link
Contributor

lichunzhu commented Nov 24, 2022

I've did some research on this, and before something deeper, I hope there would be a conclusion about which parquet lib we should use.

There are some parquet lib in golang, for example, xitongsys/parquet-go, segmentio/parquet-go and apache parquet.

I'm not familiar with any of them, is there any recommendation?

@DCjanus thanks for your research. Sorry for replying so late.
I have checked these three library and prefer using apache parquet. It's more popular and has complete writer.
Here is an example https://github.com/apache/arrow/blob/9beb930/go/parquet/pqarrow/encode_arrow_test.go#L1332.

@DCjanus
Copy link

DCjanus commented Dec 9, 2022

Previously, dumpling only needed to support text formats such as CSV and SQL, but parquet is a strongly typed file format with a scheme, that requires explicit declaration of whether a field can be null and whether it has a sign.

The existing MySQL driver cannot meet this requirement (support for getting sign information has been added in github.com/go-sql-driver/mysql v1.7, but it still cannot detect whether a field can be null).

Therefore, I think we need to call github.com/pingcap/tidb/parser to parse the create table statement and get more detailed field information in func MakeRowReceiver.

This is the only solution I can think of, but it seems to be relatively large. Do you have any other better suggestions?"

@lichunzhu
Copy link
Contributor

lichunzhu commented Dec 9, 2022

@DCjanus Dumpling will use show columns from xx when it tries to dump a table and collecting its selected field. Can the result of Null or Default column satisfy the demands?

mysql> show create table `table`;
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| table | CREATE TABLE `table` (
  `id` int(11) DEFAULT NULL,
  `a1` varchar(11) DEFAULT NULL,
  `a2` varchar(11) DEFAULT NULL,
  `a3` varchar(11) DEFAULT NULL,
  `a4` varchar(11) DEFAULT NULL,
  `a5` varchar(11) DEFAULT NULL,
  `a6` varchar(11) DEFAULT NULL,
  `a7` varchar(11) DEFAULT NULL,
  `a8` varchar(11) DEFAULT NULL,
  `a9` varchar(11) DEFAULT NULL,
  `a10` varchar(11) DEFAULT NULL,
  UNIQUE KEY `id` (`id`),
  KEY `a` (`a1`,`a8`),
  KEY `b` (`a2`,`a3`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> show columns from `table`;
+-------+-------------+------+-----+---------+-------+
| Field | Type        | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| id    | int(11)     | YES  | UNI | NULL    |       |
| a1    | varchar(11) | YES  | MUL | NULL    |       |
| a2    | varchar(11) | YES  | MUL | NULL    |       |
| a3    | varchar(11) | YES  |     | NULL    |       |
| a4    | varchar(11) | YES  |     | NULL    |       |
| a5    | varchar(11) | YES  |     | NULL    |       |
| a6    | varchar(11) | YES  |     | NULL    |       |
| a7    | varchar(11) | YES  |     | NULL    |       |
| a8    | varchar(11) | YES  |     | NULL    |       |
| a9    | varchar(11) | YES  |     | NULL    |       |
| a10   | varchar(11) | YES  |     | NULL    |       |
+-------+-------------+------+-----+---------+-------+
11 rows in set (0.01 sec)

@DCjanus
Copy link

DCjanus commented Dec 9, 2022

To prevent users from misreading interrupted parquet files, we need to do some extra work. Two potential methods are:

  1. Allocate a large buffer in memory, and only write complete parquet files to external storage.
  2. Write to an external storage stream with a file name like xxx.parquet.tmp, and then rename it to xxx.parquet once it is complete. For LocalFS, this is relatively cheap and can reduce memory usage, but for storage like S3, the rename process has additional overhead (it needs to be read into memory, written to the new file, and then the old file deleted), which may not be acceptable for large files or memory-constrained scenarios.

Do you have any suggestions?

@lichunzhu
Copy link
Contributor

@DCjanus, Sorry for reply so late. I'm on leave in the last few days.

Allocate a large buffer in memory, and only write complete parquet files to external storage.

This is definitely not okay. The cached parquet file might be too large which may cause dumpling OOM.

I simply test this situation. I think we can't read this parquet file successfully if we don't correctly close this parquet file writer. I think this is enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/dumpling This is related to Dumpling of TiDB. feature/discussing This feature request is discussing among product managers priority/P2 The issue has P2 priority. type/feature-request Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

5 participants