-
Notifications
You must be signed in to change notification settings - Fork 1.5k
GH-3242: Emit and Read min/max statistics for int96 timestamp columns #3243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
It looks like Alkis started a discussion on the ML, but we probably want to come to a consensus and update https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1079 first before merging this. |
parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveComparator.java
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveComparator.java
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveComparator.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/ValidInt96Stats.java
Outdated
Show resolved
Hide resolved
@emkornfield : I created the PR: apache/parquet-format#503 |
After this change, |
@emkornfield @rdblue Given the latest message in the mailing thread, can we merge this? If so, could you help me with addressing this issue |
Sorry I'm not sure on the API change. Lets get (lazy) consensus on the message for this direction. I think @rdblue had the strongest opinion on SortOrder |
@wgtmac : could you help with this? |
@wgtmac : Could you help with this, or suggest who could help out? Thanks! |
@rahulketch Sorry I'm too busy these days. Perhaps I can find some time next week. |
parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveComparator.java
Outdated
Show resolved
Hide resolved
static final PrimitiveComparator<Binary> BINARY_AS_INT96_TIMESTAMP_COMPARATOR = new BinaryComparator() { | ||
@Override | ||
int compareBinary(Binary b1, Binary b2) { | ||
ByteBuffer bb1 = b1.toByteBuffer().slice(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check if their lengths are exactly 12 before anything? I recall that BINARY_AS_FLOAT16_COMPARATOR
did this.
parquet-column/src/main/java/org/apache/parquet/ValidInt96Stats.java
Outdated
Show resolved
Hide resolved
|
||
try { | ||
ParsedVersion version = VersionParser.parse(createdBy); | ||
if ("parquet-mr".equals(version.application)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall that @rdblue has an opinion in maintaining an allow-list here.
|
||
public static Statistics toParquetStatistics(org.apache.parquet.column.statistics.Statistics stats) { | ||
return toParquetStatistics(stats, ParquetProperties.DEFAULT_STATISTICS_TRUNCATE_LENGTH); | ||
public Statistics toParquetStatistics(String createdBy, org.apache.parquet.column.statistics.Statistics stats) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it a breaking change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure of a better way to achieve this. We need to:
- Know created-by in order to have an allow-list for valid readers
- Make the function non-static so we can respect the flag which disables reading the stats.
Any ideas for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just follow the existing pattern of toParquetStatistics
to introduce yet another overload?
public static Statistics toParquetStatistics(
org.apache.parquet.column.statistics.Statistics stats, int truncateLength, bool convertInt96Stats)
@gszadovszky WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @wgtmac.
Or, we can deprecate the static methods (marking them for removal in 2.0) and create non-static ones to be used from now on.
…veComparator.java Co-authored-by: Gang Wu <ustcwg@gmail.com>
…s.java Co-authored-by: Gang Wu <ustcwg@gmail.com>
b107ae8
to
90656f1
Compare
Rationale for this change
Parquet-java does not emit or read stats for int96 timestamp columns. Since int96 is used as the default timestamp in spark, this limits a lot of optimization opportunities. Engines like Photon populate the statistics for the int96 timestamps correctly. So parquet-java can also emit the statistics, and also allow reading these statistics from known good writers.
What changes are included in this PR?
parquet.read.int96stats.enabled
to control if the stats are read. It is defaulted to trueValidInt96Stats
: Reads stats from known good writers. Currently including:a.
parquet-mr 1.15.0+
b.
photon
Are these changes tested?
Are there any user-facing changes?
toParquetStatistics
andfromParquetStatistics
are no longer static functions. It doesn't look like there is a good reason for these functions to be static.ParquetMetadataConverter
is now different due to an added parameter.Closes #3242