Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add size statistics to ParquetMetaData introduced in PARQUET-2261 #5486

Closed
wants to merge 54 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
a7e41c3
regen thrift with size statistics added
etseidl Feb 7, 2024
788eef3
first cut at adding page size statistics
etseidl Feb 9, 2024
6296ada
add new stats to chunk metadata test
etseidl Feb 16, 2024
84f3d7a
Merge branch 'apache:master' into size_stats
etseidl Mar 8, 2024
0da05a8
fix escapes
etseidl Mar 12, 2024
7301aeb
Merge remote-tracking branch 'origin/master' into size_stats
etseidl Mar 12, 2024
6e5fece
format
etseidl Mar 12, 2024
457eb4a
formatting
etseidl Mar 12, 2024
18a5732
add escapes
etseidl Mar 12, 2024
658512e
Merge remote-tracking branch 'origin/master' into size_stats
etseidl Mar 12, 2024
81c2b2e
Merge remote-tracking branch 'origin/master' into size_stats
etseidl Apr 29, 2024
29dde50
Merge branch 'size_stats' of github.com:etseidl/arrow-rs into size_stats
etseidl Jun 27, 2024
84f8512
Merge remote-tracking branch 'origin/master' into size_stats
etseidl Jun 27, 2024
9635e5e
add test of SizeStatistics.unencoded_byte_array_data_bytes
etseidl Jun 27, 2024
c5c07b6
test def histogram as well, rename test
etseidl Jun 27, 2024
6dd160f
add an assert
etseidl Jun 27, 2024
917b412
refactor and add test of def histogram with nulls
etseidl Jun 27, 2024
f8961a3
add test of repetition level histogram
etseidl Jun 28, 2024
73fa099
revert changes to test_roundtrip
etseidl Jun 28, 2024
00ca596
suggestion from review
etseidl Jul 1, 2024
6acc500
add to documentation as suggested in review
etseidl Jul 1, 2024
787e3e8
make histograms optional
etseidl Jul 2, 2024
46851f4
add histograms to PageIndex
etseidl Jul 2, 2024
4f8487b
use Vec::push()
etseidl Jul 2, 2024
903b06b
formatting
etseidl Jul 2, 2024
fa89836
check size stats in read metadata
etseidl Jul 2, 2024
2800cc7
check unencoded_byte_array_data_bytes is not set for int cols
etseidl Jul 2, 2024
95a0535
rewrite test_byte_array_size_statistics() to not use test_roundtrip()
etseidl Jul 2, 2024
fc66a59
add unencoded_byte_array_data_bytes support in page index
etseidl Jul 2, 2024
542570f
Merge remote-tracking branch 'origin/master' into size_stats
etseidl Jul 2, 2024
7be97e5
update expected sizes to account for new stats
etseidl Jul 2, 2024
f5ab47b
only write SizeStatistics in ColumnMetaData if statistics are enabled
etseidl Jul 3, 2024
a008e9e
add a little documentation
etseidl Jul 5, 2024
87ccec2
add ParquetOffsetIndex to avoid double read of OffsetIndex
etseidl Jul 5, 2024
3eead30
cleanup
etseidl Jul 5, 2024
ddf40c3
use less verbose update of variable_length_bytes
etseidl Jul 5, 2024
0ebb72f
add some documentation
etseidl Jul 6, 2024
393aea1
update to latest thrift (as of 11 Jul 2024) from parquet-format
etseidl Jul 11, 2024
1c12fb8
pass None for optional size statistics
etseidl Jul 11, 2024
53cd5fa
escape HTML tags
etseidl Jul 11, 2024
45f25a8
Merge remote-tracking branch 'origin/master' into size_stats
etseidl Jul 11, 2024
98025cc
don't need to escape brackets in arrays
etseidl Jul 11, 2024
7b59246
Merge remote-tracking branch 'github/update_parquet_thrift' into size…
etseidl Jul 11, 2024
65096dd
use consistent naming
etseidl Jul 11, 2024
08065ad
suggested doc changes
etseidl Jul 11, 2024
1cbd4b7
more suggested doc changes
etseidl Jul 11, 2024
dce3513
use more asserts in tests
etseidl Jul 11, 2024
f661839
move histogram logic into PageMetrics and ColumnMetrics
etseidl Jul 12, 2024
818a614
refactor some to reduce code duplication, finish docs
etseidl Jul 12, 2024
c391dec
account for new size statistics in heap size calculations
etseidl Jul 12, 2024
4816a95
add histogram examples to docs
etseidl Jul 12, 2024
e2faf2d
Merge remote-tracking branch 'origin/master' into size_stats
etseidl Jul 12, 2024
d92ae20
add some fixmes
etseidl Jul 14, 2024
69dd652
leave not to self
etseidl Jul 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion parquet/regen.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
# specific language governing permissions and limitations
# under the License.

REVISION=46cc3a0647d301bb9579ca8dd2cc356caf2a72d2
REVISION=5b564f3c47679526cf72e54f207013f28f53acc4

SOURCE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]:-$0}")" && pwd)"

Expand Down
17 changes: 16 additions & 1 deletion parquet/src/arrow/arrow_writer/byte_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ macro_rules! downcast_op {
struct FallbackEncoder {
encoder: FallbackEncoderImpl,
num_values: usize,
variable_length_bytes: i64,
}

/// The fallback encoder in use
Expand Down Expand Up @@ -152,6 +153,7 @@ impl FallbackEncoder {
Ok(Self {
encoder,
num_values: 0,
variable_length_bytes: 0,
})
}

Expand All @@ -168,7 +170,8 @@ impl FallbackEncoder {
let value = values.value(*idx);
let value = value.as_ref();
buffer.extend_from_slice((value.len() as u32).as_bytes());
buffer.extend_from_slice(value)
buffer.extend_from_slice(value);
self.variable_length_bytes += value.len() as i64;
}
}
FallbackEncoderImpl::DeltaLength { buffer, lengths } => {
Expand All @@ -177,6 +180,7 @@ impl FallbackEncoder {
let value = value.as_ref();
lengths.put(&[value.len() as i32]).unwrap();
buffer.extend_from_slice(value);
self.variable_length_bytes += value.len() as i64;
}
}
FallbackEncoderImpl::Delta {
Expand Down Expand Up @@ -205,6 +209,7 @@ impl FallbackEncoder {
buffer.extend_from_slice(&value[prefix_length..]);
prefix_lengths.put(&[prefix_length as i32]).unwrap();
suffix_lengths.put(&[suffix_length as i32]).unwrap();
self.variable_length_bytes += value.len() as i64;
}
}
}
Expand Down Expand Up @@ -269,12 +274,16 @@ impl FallbackEncoder {
}
};

let variable_length_bytes = Some(self.variable_length_bytes);
self.variable_length_bytes = 0;

Ok(DataPageValues {
buf: buf.into(),
num_values: std::mem::take(&mut self.num_values),
encoding,
min_value,
max_value,
variable_length_bytes,
})
}
}
Expand Down Expand Up @@ -321,6 +330,7 @@ impl Storage for ByteArrayStorage {
struct DictEncoder {
interner: Interner<ByteArrayStorage>,
indices: Vec<u64>,
variable_length_bytes: i64,
}

impl DictEncoder {
Expand All @@ -336,6 +346,7 @@ impl DictEncoder {
let value = values.value(*idx);
let interned = self.interner.intern(value.as_ref());
self.indices.push(interned);
self.variable_length_bytes += value.as_ref().len() as i64;
}
}

Expand Down Expand Up @@ -384,12 +395,16 @@ impl DictEncoder {

self.indices.clear();

let variable_length_bytes = Some(self.variable_length_bytes);
self.variable_length_bytes = 0;

DataPageValues {
buf: encoder.consume().into(),
num_values,
encoding: Encoding::RLE_DICTIONARY,
min_value,
max_value,
variable_length_bytes,
}
}
}
Expand Down
1 change: 1 addition & 0 deletions parquet/src/arrow/async_reader/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1538,6 +1538,7 @@ mod tests {
vec![row_group_meta],
None,
Some(vec![offset_index.clone()]),
None,
);

let metadata = Arc::new(metadata);
Expand Down
8 changes: 8 additions & 0 deletions parquet/src/column/writer/encoder.rs
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ pub struct DataPageValues<T> {
pub encoding: Encoding,
pub min_value: Option<T>,
pub max_value: Option<T>,
pub variable_length_bytes: Option<i64>,
}

/// A generic encoder of [`ColumnValues`] to data and dictionary pages used by
Expand Down Expand Up @@ -131,6 +132,7 @@ pub struct ColumnValueEncoderImpl<T: DataType> {
min_value: Option<T::T>,
max_value: Option<T::T>,
bloom_filter: Option<Sbbf>,
variable_length_bytes: Option<i64>,
}

impl<T: DataType> ColumnValueEncoderImpl<T> {
Expand All @@ -150,6 +152,10 @@ impl<T: DataType> ColumnValueEncoderImpl<T> {
update_min(&self.descr, &min, &mut self.min_value);
update_max(&self.descr, &max, &mut self.max_value);
}

if let Some(var_bytes) = T::T::variable_length_bytes(slice) {
*self.variable_length_bytes.get_or_insert(0) += var_bytes;
}
}

// encode the values into bloom filter if enabled
Expand Down Expand Up @@ -203,6 +209,7 @@ impl<T: DataType> ColumnValueEncoder for ColumnValueEncoderImpl<T> {
bloom_filter,
min_value: None,
max_value: None,
variable_length_bytes: None,
})
}

Expand Down Expand Up @@ -296,6 +303,7 @@ impl<T: DataType> ColumnValueEncoder for ColumnValueEncoderImpl<T> {
num_values: std::mem::take(&mut self.num_values),
min_value: self.min_value.take(),
max_value: self.max_value.take(),
variable_length_bytes: self.variable_length_bytes.take(),
})
}
}
Expand Down
Loading