-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC: Fix importing ORC files with float and double columns #3320
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Might want to modify fromOrcMax
for the same? Also I wonder if a similar change is required for importing parquet files as well...
// imported files will not have metrics that were tracked by Iceberg, so fall back to the file's metrics. | ||
min = ((DoubleColumnStatistics) columnStats).getMinimum(); | ||
if (type.typeId() == Type.TypeID.FLOAT) { | ||
min = ((Double) min).floatValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering if we want to reassign min to negative infinite if the this value tracked by ORC is NaN
Good point. I forgot to update max. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been testing this, but it looks like the same thing needs to be done for the max (this seems to be just the min).
iceberg/orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java
Lines 236 to 242 in 7a58028
} else if (columnStats instanceof DoubleColumnStatistics) { | |
// since Orc includes NaN for upper/lower bounds of floating point columns, and we don't want this behavior, | |
// we have tracked metrics for such columns ourselves and thus do not need to rely on Orc's column statistics. | |
Preconditions.checkNotNull(fieldMetrics, | |
"[BUG] Float or double type columns should have metrics being tracked by Iceberg Orc writers"); | |
max = fieldMetrics.upperBound(); | |
} else if (columnStats instanceof StringColumnStatistics) { |
I'm applying the same changes to max in the branch I'm testing in. So far, the |
I added tests for this here: #3332 I first verified that I hit the error, then added in the min fix here, got the max error, and then added that fix as well. |
Closing in favor of #3332. |
The OrcMetrics code assumed that Iceberg metrics would be available, but that isn't the case when importing existing ORC files.