Skip to content

Conversation

@tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

Adds a CLI tool for efficiently concatenating parquet files, this definitely could be made more sophisticated, but serves as a demo of how to use the new API added in #4269 whilst also providing some utility to users

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label May 24, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend adding a doc comment from append_column added in #4269 to this binary as an example of how to use it

I tested this out with some local parquet data:

parquet-concat combined.parquet 1.parquet 2.parquet
(arrow_dev) alamb@MacBook-Pro-8:~/Downloads$ du -s -h 1.parquet 2.parquet combined.parquet
 69M	1.parquet
 40K	2.parquet
 69M	combined.parquet
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 9332578         |
+-----------------+
1 row in set. Query took 0.007 seconds.
❯ select count(*) from '2.parquet';
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 637             |
+-----------------+
❯ select count(*) from 'combined.parquet';
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 9333215         |
+-----------------+
❯ select avg(value) from (select value from '1.parquet' UNION ALL select value from '2.parquet');
+----------------------+
| AVG(1.parquet.value) |
+----------------------+
| 82.11578603943015    |
+----------------------+
1 row in set. Query took 0.035 seconds.
❯ select avg(value) from 'combined.parquet';
+-----------------------------+
| AVG(combined.parquet.value) |
+-----------------------------+
| 82.11578603943015           |
+-----------------------------+
1 row in set. Query took 0.015 seconds.

Works great for me 🚀

@tustvold
Copy link
Contributor Author

I recommend adding a doc comment from append_column added in #4269 to this binary as an example of how to use it

I'm not actually sure how to do this, I will link ArrowWriter following #3871 as I think that will be more discoverable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants