Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC: Fix importing ORC files with float and double columns #3320

Closed
wants to merge 1 commit into from

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Oct 19, 2021

The OrcMetrics code assumed that Iceberg metrics would be available, but that isn't the case when importing existing ORC files.

@github-actions github-actions bot added the ORC label Oct 19, 2021
@rdblue rdblue requested a review from yyanyy October 19, 2021 23:16
Copy link
Contributor

@yyanyy yyanyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Might want to modify fromOrcMax for the same? Also I wonder if a similar change is required for importing parquet files as well...

// imported files will not have metrics that were tracked by Iceberg, so fall back to the file's metrics.
min = ((DoubleColumnStatistics) columnStats).getMinimum();
if (type.typeId() == Type.TypeID.FLOAT) {
min = ((Double) min).floatValue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if we want to reassign min to negative infinite if the this value tracked by ORC is NaN

@rdblue
Copy link
Contributor Author

rdblue commented Oct 20, 2021

Good point. I forgot to update max.

Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been testing this, but it looks like the same thing needs to be done for the max (this seems to be just the min).

} else if (columnStats instanceof DoubleColumnStatistics) {
// since Orc includes NaN for upper/lower bounds of floating point columns, and we don't want this behavior,
// we have tracked metrics for such columns ourselves and thus do not need to rely on Orc's column statistics.
Preconditions.checkNotNull(fieldMetrics,
"[BUG] Float or double type columns should have metrics being tracked by Iceberg Orc writers");
max = fieldMetrics.upperBound();
} else if (columnStats instanceof StringColumnStatistics) {

@kbendick
Copy link
Contributor

Good point. I forgot to update max.

I'm applying the same changes to max in the branch I'm testing in. So far, the min fixes seem to work.

@kbendick
Copy link
Contributor

I added tests for this here: #3332

I first verified that I hit the error, then added in the min fix here, got the max error, and then added that fix as well.

@rdblue
Copy link
Contributor Author

rdblue commented Oct 20, 2021

Closing in favor of #3332.

@rdblue rdblue closed this Oct 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants