Skip to content

Conversation

CalderWhite
Copy link

First off: I love pqrs!! Thank you so much for creating it and maintaining it :)

The code in this PR is super bare bones and does not implement everything. It was just really annoying to have all my data blow up whenever I would merge my chunks together. At the same time, I did not want to rewrite a compressed merge script for every use case.

The solution I landed on was to use a small config file to specify the most impactful options like compression, compression_level, set_dictionary_enabled and also column level encodings.

Putting this PR up in case you are interested in using it in the main branch. It would be cool to not have to tell people to install my fork haha.

Thanks

Copy link

@trisha trisha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy open source friday!

@CalderWhite
Copy link
Author

happy open source friday!

LOLLL

Not intended for the main branch, just CalderWhite/pqrs.
Copy link
Owner

@manojkarthick manojkarthick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey thanks for the PR! I left a few comments.

Comment on lines +31 to +39
#[derive(Debug, Clone, Deserialize)]
pub struct MergeConfig {
pub set_dictionary_enabled: Option<bool>,
/// The encodings for this are the just text values of the enum parquet::basic::Encoding
pub column_encodings: Option<HashMap<String, String>>,
pub column_dictionary_enabled: Option<HashMap<String, bool>>,
pub compression: Option<String>,
pub compression_level: Option<u32>,
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want to use a config file instead of command line options for these?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using CLI opts your command blows up to be huge when you have a large number of columns. Using a JSON file is super convenient and easy to read :)

Comment on lines +70 to +77
if !encoding_mappings.contains_key(encoding_str.as_str()) {
return Err(PQRSError::IllegalEncodingType());
}

let encoding = *encoding_mappings
.get(encoding_str.clone().as_str())
.unwrap();
props = props.set_column_encoding(ColumnPath::from(column_name), encoding)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be made idiomatic use pattern matching?


if let Some(column_de) = merge_config.column_dictionary_enabled {
for (column_name, de) in column_de {
println!("{column_name}");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably don't want println!()s in the release

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Sorry.

}

if let Some(compression_algo) = merge_config.compression {
if compression_algo.to_lowercase() == "brotli" {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for compression_algo, I think we can use pattern matching here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout!

Comment on lines +9 to +13
// use jemalloc for release builds
extern crate jemallocator;
#[global_allocator]
static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc;

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm unfamiliar with jemallocator. can you give a tl;dr on these performance optimizations? what do they do?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to only use my first commit in the PR? The second commit is performance optimized stuff that only belongs on my fork.

TL;DR this can be removed for the main branch

@CalderWhite
Copy link
Author

Responded to all of your comments! Some of them can be completely removed from the PR, others require some modification and finally the config file is a matter of your opinion for the main repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants