-
Notifications
You must be signed in to change notification settings - Fork 35
Added config file to merge command #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy open source friday!
LOLLL |
Not intended for the main branch, just CalderWhite/pqrs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey thanks for the PR! I left a few comments.
#[derive(Debug, Clone, Deserialize)] | ||
pub struct MergeConfig { | ||
pub set_dictionary_enabled: Option<bool>, | ||
/// The encodings for this are the just text values of the enum parquet::basic::Encoding | ||
pub column_encodings: Option<HashMap<String, String>>, | ||
pub column_dictionary_enabled: Option<HashMap<String, bool>>, | ||
pub compression: Option<String>, | ||
pub compression_level: Option<u32>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we want to use a config file instead of command line options for these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using CLI opts your command blows up to be huge when you have a large number of columns. Using a JSON file is super convenient and easy to read :)
if !encoding_mappings.contains_key(encoding_str.as_str()) { | ||
return Err(PQRSError::IllegalEncodingType()); | ||
} | ||
|
||
let encoding = *encoding_mappings | ||
.get(encoding_str.clone().as_str()) | ||
.unwrap(); | ||
props = props.set_column_encoding(ColumnPath::from(column_name), encoding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be made idiomatic use pattern matching?
|
||
if let Some(column_de) = merge_config.column_dictionary_enabled { | ||
for (column_name, de) in column_de { | ||
println!("{column_name}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably don't want println!()
s in the release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! Sorry.
} | ||
|
||
if let Some(compression_algo) = merge_config.compression { | ||
if compression_algo.to_lowercase() == "brotli" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for compression_algo
, I think we can use pattern matching here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good callout!
// use jemalloc for release builds | ||
extern crate jemallocator; | ||
#[global_allocator] | ||
static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm unfamiliar with jemallocator
. can you give a tl;dr on these performance optimizations? what do they do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to only use my first commit in the PR? The second commit is performance optimized stuff that only belongs on my fork.
TL;DR this can be removed for the main branch
Responded to all of your comments! Some of them can be completely removed from the PR, others require some modification and finally the config file is a matter of your opinion for the main repo. |
First off: I love pqrs!! Thank you so much for creating it and maintaining it :)
The code in this PR is super bare bones and does not implement everything. It was just really annoying to have all my data blow up whenever I would merge my chunks together. At the same time, I did not want to rewrite a compressed merge script for every use case.
The solution I landed on was to use a small config file to specify the most impactful options like compression, compression_level, set_dictionary_enabled and also column level encodings.
Putting this PR up in case you are interested in using it in the main branch. It would be cool to not have to tell people to install my fork haha.
Thanks