Skip to content

GH-3406: fix Float16 statistics handling for NaN and zero values#3407

Open
HuaHuaY wants to merge 1 commit intoapache:masterfrom
HuaHuaY:fix_issue_3406
Open

GH-3406: fix Float16 statistics handling for NaN and zero values#3407
HuaHuaY wants to merge 1 commit intoapache:masterfrom
HuaHuaY:fix_issue_3406

Conversation

@HuaHuaY
Copy link
Contributor

@HuaHuaY HuaHuaY commented Feb 27, 2026

Rationale for this change

  1. When creating column index and NaN is the min or max value, column index will not be generated.
  2. When creating statistic and the min value is +0, the min value will be changed to -0. When the max value is -0, it will be changed to +0.

What changes are included in this PR?

  • In Statistics<>.Float16Builder and BinaryColumnIndexBuilder, this PR will adjust value when the min or max value is +0 or -0.
  • In BinaryColumnIndexBuilder, this PR will mark the column index is invalid when the min or max value is NaN.
  • In Statistics<>.Float16Builder, statistic will be changed to (+0.0, +0.0) which is the same behavior as Double.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes. This PR corrects the behavior about Float16.

Closes #3406

Copilot AI review requested due to automatic review settings February 27, 2026 09:12
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Float16 min/max edge cases so statistics and column indexes behave correctly when encountering NaN and signed-zero values, aligning behavior with Float/Double semantics and preventing invalid column indexes from being produced.

Changes:

  • Add shared Float16 Binary constants for +0/-0 and reuse them in statistics/index logic.
  • Normalize Float16 min/max for signed zero and invalidate Float16 column indexes when NaN is present.
  • Update/extend Hadoop round-trip tests to cover signed-zero normalization and NaN column-index invalidation.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
parquet-hadoop/src/test/java/org/apache/parquet/statistics/TestFloat16Statistics.java Updates expected min/max placeholders for NaN stats handling.
parquet-hadoop/src/test/java/org/apache/parquet/statistics/TestFloat16ReadWriteRoundTrip.java Adjusts zero-related expectations and adds a NaN column-index test expecting no index.
parquet-column/src/main/java/org/apache/parquet/schema/Float16.java Introduces shared +0/-0 Float16 Binary constants.
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/BinaryColumnIndexBuilder.java Adds Float16-specific normalization (+0/-0) and NaN invalidation for column index creation.
parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java Reuses Float16 constants and aligns NaN placeholder behavior with float/double builders.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@HuaHuaY HuaHuaY force-pushed the fix_issue_3406 branch 2 times, most recently from 3809385 to b42fcfb Compare February 27, 2026 09:33
@HuaHuaY
Copy link
Contributor Author

HuaHuaY commented Feb 27, 2026

Please take a look again @wgtmac

@HuaHuaY HuaHuaY requested a review from wgtmac February 27, 2026 10:14
@wgtmac
Copy link
Member

wgtmac commented Feb 27, 2026

@gszadovszky Could you help review this as well? I think this fix makes a lot of sense to be consistent with what currently float and double types do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Float16 treats Nan, +0 and -0 differently than Double.

4 participants