Added config file to merge command #48

CalderWhite · 2023-11-17T20:45:26Z

First off: I love pqrs!! Thank you so much for creating it and maintaining it :)

The code in this PR is super bare bones and does not implement everything. It was just really annoying to have all my data blow up whenever I would merge my chunks together. At the same time, I did not want to rewrite a compressed merge script for every use case.

The solution I landed on was to use a small config file to specify the most impactful options like compression, compression_level, set_dictionary_enabled and also column level encodings.

Putting this PR up in case you are interested in using it in the main branch. It would be cool to not have to tell people to install my fork haha.

Thanks

trisha

happy open source friday!

CalderWhite · 2023-11-18T03:48:44Z

happy open source friday!

LOLLL

Not intended for the main branch, just CalderWhite/pqrs.

manojkarthick

Hey thanks for the PR! I left a few comments.

manojkarthick · 2023-11-27T23:04:30Z

src/commands/merge.rs

+#[derive(Debug, Clone, Deserialize)]
+pub struct MergeConfig {
+    pub set_dictionary_enabled: Option<bool>,
+    /// The encodings for this are the just text values of the enum parquet::basic::Encoding
+    pub column_encodings: Option<HashMap<String, String>>,
+    pub column_dictionary_enabled: Option<HashMap<String, bool>>,
+    pub compression: Option<String>,
+    pub compression_level: Option<u32>,
+}


why do we want to use a config file instead of command line options for these?

When using CLI opts your command blows up to be huge when you have a large number of columns. Using a JSON file is super convenient and easy to read :)

manojkarthick · 2023-11-27T23:05:20Z

src/commands/merge.rs

+            if !encoding_mappings.contains_key(encoding_str.as_str()) {
+                return Err(PQRSError::IllegalEncodingType());
+            }
+
+            let encoding = *encoding_mappings
+                .get(encoding_str.clone().as_str())
+                .unwrap();
+            props = props.set_column_encoding(ColumnPath::from(column_name), encoding)


can this be made idiomatic use pattern matching?

manojkarthick · 2023-11-27T23:05:57Z

src/commands/merge.rs

+
+    if let Some(column_de) = merge_config.column_dictionary_enabled {
+        for (column_name, de) in column_de {
+            println!("{column_name}");


we probably don't want println!()s in the release

Oops! Sorry.

manojkarthick · 2023-11-27T23:06:28Z

src/commands/merge.rs

+    }
+
+    if let Some(compression_algo) = merge_config.compression {
+        if compression_algo.to_lowercase() == "brotli" {


same for compression_algo, I think we can use pattern matching here?

Good callout!

manojkarthick · 2023-11-27T23:06:58Z

src/main.rs

+// use jemalloc for release builds
+extern crate jemallocator;
+#[global_allocator]
+static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc;
+


i'm unfamiliar with jemallocator. can you give a tl;dr on these performance optimizations? what do they do?

Is there a way to only use my first commit in the PR? The second commit is performance optimized stuff that only belongs on my fork.

TL;DR this can be removed for the main branch

CalderWhite · 2023-11-29T11:57:04Z

Responded to all of your comments! Some of them can be completely removed from the PR, others require some modification and finally the config file is a matter of your opinion for the main repo.

Added config file to merge command.

4b97a54

trisha reviewed Nov 17, 2023

View reviewed changes

CalderWhite added 2 commits November 20, 2023 17:21

Add ZSTD

d19db7b

Add perf opts.

6ba1fac

Not intended for the main branch, just CalderWhite/pqrs.

manojkarthick reviewed Nov 27, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added config file to merge command #48

Added config file to merge command #48

Uh oh!

CalderWhite commented Nov 17, 2023

Uh oh!

trisha left a comment

Uh oh!

CalderWhite commented Nov 18, 2023

Uh oh!

manojkarthick left a comment

Uh oh!

manojkarthick Nov 27, 2023

Uh oh!

CalderWhite Nov 29, 2023

Uh oh!

manojkarthick Nov 27, 2023

Uh oh!

manojkarthick Nov 27, 2023

Uh oh!

CalderWhite Nov 29, 2023

Uh oh!

manojkarthick Nov 27, 2023

Uh oh!

CalderWhite Nov 29, 2023

Uh oh!

manojkarthick Nov 27, 2023

Uh oh!

CalderWhite Nov 29, 2023

Uh oh!

CalderWhite commented Nov 29, 2023

Uh oh!

Uh oh!

Added config file to merge command #48

Are you sure you want to change the base?

Added config file to merge command #48

Uh oh!

Conversation

CalderWhite commented Nov 17, 2023

Uh oh!

trisha left a comment

Choose a reason for hiding this comment

Uh oh!

CalderWhite commented Nov 18, 2023

Uh oh!

manojkarthick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CalderWhite commented Nov 29, 2023

Uh oh!

Uh oh!