Skip to content

Conversation

rahulketch
Copy link
Contributor

@rahulketch rahulketch commented Jun 12, 2025

Rationale for this change

Parquet-java does not emit or read stats for int96 timestamp columns. Since int96 is used as the default timestamp in spark, this limits a lot of optimization opportunities. Engines like Photon populate the statistics for the int96 timestamps correctly. So parquet-java can also emit the statistics, and also allow reading these statistics from known good writers.

What changes are included in this PR?

  1. Defining column ordering for int96
  2. Implementation of comparator for int96
  3. A flag parquet.read.int96stats.enabled to control if the stats are read. It is defaulted to true
  4. ValidInt96Stats: Reads stats from known good writers. Currently including:
    a. parquet-mr 1.15.0+
    b. photon
  5. Fixing tests which fail due to this change

Are these changes tested?

  1. Added more unit tests for the new behaviour

Are there any user-facing changes?

  1. toParquetStatistics and fromParquetStatistics are no longer static functions. It doesn't look like there is a good reason for these functions to be static.
  2. Signatures of some of the constructors of ParquetMetadataConverter is now different due to an added parameter.

Closes #3242

@emkornfield
Copy link
Contributor

It looks like Alkis started a discussion on the ML, but we probably want to come to a consensus and update https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1079 first before merging this.

@rahulketch
Copy link
Contributor Author

It looks like Alkis started a discussion on the ML, but we probably want to come to a consensus and update https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1079 first before merging this.

@emkornfield : I created the PR: apache/parquet-format#503

@rahulketch
Copy link
Contributor Author

After this change, toParquetStatistics and fromParquetStatistics in ParquetMetadataConverter are no longer static functions. This breaks a test on compatibility (see here). @emkornfield @rdblue: Do you have advice on how to move forward with those changes?

@rahulketch
Copy link
Contributor Author

@emkornfield @rdblue Given the latest message in the mailing thread, can we merge this? If so, could you help me with addressing this issue

@emkornfield
Copy link
Contributor

@emkornfield @rdblue Given the latest message in the mailing thread, can we merge this? If so, could you help me with addressing this issue

Sorry I'm not sure on the API change. Lets get (lazy) consensus on the message for this direction. I think @rdblue had the strongest opinion on SortOrder

@rahulketch
Copy link
Contributor Author

After this change, toParquetStatistics and fromParquetStatistics in ParquetMetadataConverter are no longer static functions. This breaks a test on compatibility (see here). @emkornfield @rdblue: Do you have advice on how to move forward with those changes?

@wgtmac : could you help with this?

@rahulketch
Copy link
Contributor Author

After this change, toParquetStatistics and fromParquetStatistics in ParquetMetadataConverter are no longer static functions. This breaks a test on compatibility (see here). @emkornfield @rdblue: Do you have advice on how to move forward with those changes?

@wgtmac : could you help with this?

@wgtmac : Could you help with this, or suggest who could help out? Thanks!

@wgtmac
Copy link
Member

wgtmac commented Sep 5, 2025

@rahulketch Sorry I'm too busy these days. Perhaps I can find some time next week.

static final PrimitiveComparator<Binary> BINARY_AS_INT96_TIMESTAMP_COMPARATOR = new BinaryComparator() {
@Override
int compareBinary(Binary b1, Binary b2) {
ByteBuffer bb1 = b1.toByteBuffer().slice();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check if their lengths are exactly 12 before anything? I recall that BINARY_AS_FLOAT16_COMPARATOR did this.


try {
ParsedVersion version = VersionParser.parse(createdBy);
if ("parquet-mr".equals(version.application)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall that @rdblue has an opinion in maintaining an allow-list here.


public static Statistics toParquetStatistics(org.apache.parquet.column.statistics.Statistics stats) {
return toParquetStatistics(stats, ParquetProperties.DEFAULT_STATISTICS_TRUNCATE_LENGTH);
public Statistics toParquetStatistics(String createdBy, org.apache.parquet.column.statistics.Statistics stats) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it a breaking change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure of a better way to achieve this. We need to:

  1. Know created-by in order to have an allow-list for valid readers
  2. Make the function non-static so we can respect the flag which disables reading the stats.

Any ideas for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just follow the existing pattern of toParquetStatistics to introduce yet another overload?

  public static Statistics toParquetStatistics(
      org.apache.parquet.column.statistics.Statistics stats, int truncateLength, bool convertInt96Stats)

@gszadovszky WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @wgtmac.

Or, we can deprecate the static methods (marking them for removal in 2.0) and create non-static ones to be used from now on.

rahulketch and others added 2 commits September 9, 2025 14:51
…veComparator.java

Co-authored-by: Gang Wu <ustcwg@gmail.com>
…s.java

Co-authored-by: Gang Wu <ustcwg@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Emit and Read min/max statistics for int96 timestamp columns
6 participants