Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet writer does not encode null count = 0 correctly #6256

Open
alamb opened this issue Aug 15, 2024 · 14 comments · May be fixed by #6257
Open

parquet writer does not encode null count = 0 correctly #6256

alamb opened this issue Aug 15, 2024 · 14 comments · May be fixed by #6257
Assignees
Labels

Comments

@alamb
Copy link
Contributor

alamb commented Aug 15, 2024

Describe the bug
TLDR is that the Rust parquet writer does not write out null_counts correctly in row group statistics. However, the reader has the same mistake so systems that both write and read with Rust are unaffected

While reviewing #6216 from @Michael-J-Ward (🙏 ) I am pretty sure the arrow-rs parquet writer does not do the correct thing with respect to null statistics

Specifically, when there are no nulls in the data, the writer does not emit a value for null_count in the thrift metadata (it writes the equivalent of None) -- it should instead write the equivalent of Some(0)

This will not cause issues for people using parquet-rs to read and write data as the reader also (incorrectly) reports Some(0) when the thrift metadata has None

To Reproduce
TBD

Expected behavior

  • When writing statistics for data without nulls, the parquet-rs writer should write Some(0)
  • When reading statistics, the parquet-rs reader should read None

Additional context

@mapleFU
Copy link
Member

mapleFU commented Aug 15, 2024

When reading statistics, the parquet-rs reader should read None

I'm a bit curious here, since like in page-index ( https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1088 ) and page header v2 ( https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L633 ), the null-count is better to have. Should it also Some(0) to represent "nullable type has null count, and it's 0" ?

@alamb alamb changed the title Encode parquet null count statistics correctly Encode parquet null count = 0 correctly Aug 15, 2024
@alamb
Copy link
Contributor Author

alamb commented Aug 15, 2024

When reading statistics, the parquet-rs reader should read None

I'm a bit curious here, since like in page-index ( https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1088 ) and page header v2 ( https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L633 ), the null-count is better to have. Should it also Some(0) to represent "nullable type has null count, and it's 0" ?

I think the parquet-rs writer always has the null count when writing statistics so it is always possible to write Some(...) . At the moment, it writes None (aka missing)

I think we should change the writer to write Some(0)

The difference is when reading a parquet file from some other writer that may not have written a value for the null_count field.

The current parquet-rs code will assume a missing null count field means "0" which might not be correct

Previously to #6216 0 or "missing" were not distinguished in the rust parquet reader API so there was no way to communicate this difference. Now that the API supports the difference we can.

I will try and document this as clearly as I can

@mapleFU
Copy link
Member

mapleFU commented Aug 15, 2024

I think the parquet-rs writer always has the null count when writing statistics so it will always write Some(...)

I've checked parquet-java and parquet-c++, all of these seems to write null-count with 0, but I don't know the case of legacy data. Maybe @wgtmac knows?

@alamb
Copy link
Contributor Author

alamb commented Aug 15, 2024

I agree it is pretty unlikely to find one of these files in practice

@mapleFU
Copy link
Member

mapleFU commented Aug 15, 2024

seems parquet-java distinguishes None and 0. Maybe I can draft a pr in parquet-format to enforcing this or get more info about this...🤔

But otherplace it's checked: https://github.com/apache/parquet-java/blob/d4384d3f2e7703fab6363fd9cd80a001e9561db2/parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java#L128

@alamb
Copy link
Contributor Author

alamb commented Aug 15, 2024

seems parquet-java distinguishes None and 0. Maybe I can draft a pr in parquet-format to enforcing this or get more info about this...🤔

I am not sure how important it is to enforce that this statistics field is set

Note that all the parquet files written by the rust writer so far will have had their null counts set to None...

@mapleFU
Copy link
Member

mapleFU commented Aug 15, 2024

From my aspects, 0 and None is different, and parquet-c++ / parquet-java always writes the null_count if has any value. Perhaps there're some old file write it to 0, and make filter pruning becomes buggy, lol

Anyway, we all agree that

When writing statistics for data without nulls, the parquet-rs writer should write Some(0)

I'll draft a discuss in mailling list about this tomorrow, a bit late in utc-8 and I have to sleep :-)

@alamb
Copy link
Contributor Author

alamb commented Aug 15, 2024

I'll draft a discuss in mailling list about this tomorrow, a bit late in utc-8 and I have to sleep :-)

Thank you . Pleasant dreams 💤

@alamb
Copy link
Contributor Author

alamb commented Aug 15, 2024

From my aspects, 0 and None is different,

Yes, this makes sense to me

and parquet-c++ / parquet-java always writes the null_count if has any value.

Yes, this is what I think the rust writer should do too (I am making a PR to make it so)

@alamb
Copy link
Contributor Author

alamb commented Aug 15, 2024

I made a PR with a proposed fix: #6257 which turned out to be larger than I expected (though it is mostly tests)

@Michael-J-Ward
Copy link
Contributor

page header v2 ( https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L633 ), the null-count is better to have

Is there a semantic difference between DataPageHeaderV2.num_nulls and DataPageHeaderV2.statistics.null_count?

parquet-rs currently uses the same value for both, and truncates num_nulls.

DataPageHeaderV2.num_nulls

let data_page = Page::DataPageV2 {
buf: buffer.into(),
num_values: self.page_metrics.num_buffered_values,
encoding: values_data.encoding,
num_nulls: self.page_metrics.num_page_nulls as u32,
num_rows: self.page_metrics.num_buffered_rows,
def_levels_byte_len: def_levels_byte_len as u32,
rep_levels_byte_len: rep_levels_byte_len as u32,
is_compressed: self.compressor.is_some(),
statistics: page_statistics,
};

DataPageHeaderV2.statistics.null_count
https://github.com/apache/arrow-rs/blob/0f7116bf81a4cce10a6e3ae4ed9b6ee01a20c462/parquet/src/column/writer/mod.rs#L889C1-L906C11


This stuck out to me because #6257 adds the checked conversion for u64 -> i64.

@alamb
Copy link
Contributor Author

alamb commented Aug 15, 2024

I am not sure to be honest 🤔

@alamb alamb changed the title Encode parquet null count = 0 correctly parquet writer does not write null count = 0 correctly Aug 15, 2024
@alamb alamb changed the title parquet writer does not write null count = 0 correctly parquet writer does not encode null count = 0 correctly Aug 15, 2024
@mapleFU
Copy link
Member

mapleFU commented Aug 16, 2024

Is there a semantic difference between DataPageHeaderV2.num_nulls and DataPageHeaderV2.statistics.null_count?

@Michael-J-Ward I think no 🤔

Besides the DataPage V2 seems not used by most community. I can also mention this is maillist.

@mapleFU
Copy link
Member

mapleFU commented Aug 16, 2024

I've update it here: https://lists.apache.org/thread/oqd9lt7p0jtf217hg0667xhk0zvwbgvt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants