-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Orc: Closes #5777 - Obtain ORC stripe offsets from writer #5778
Conversation
45ef127
to
b7350b6
Compare
b7350b6
to
7da48bc
Compare
7da48bc
to
e2cfd13
Compare
…pening a written file
e2cfd13
to
441b327
Compare
@kbendick Hope you are still looking at ORC support on Iceberg. Can you please take a look at this PR? |
return Collections.unmodifiableList(Lists.transform(stripes, StripeInformation::getOffset)); | ||
} catch (IOException e) { | ||
throw new RuntimeIOException(e, "Can't close ORC reader %s", file.location()); | ||
throw new RuntimeIOException( | ||
e, "Cannot receive stripe information from writer for %s", file.location()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Cannot receive" seems like a strange way to state that getStripes
failed. What about Failed to get stripe information from writer for: %s
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, will change.
|
||
@SuppressWarnings("checkstyle:AvoidEscapedUnicodeCharacters") | ||
@Test | ||
public void testInvalidUpperBoundString() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this related to this PR? What is the purpose of this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that there was no unit test on ORC write in the project. I ported the current parquet test and introduced the check on offsets.
If you would prefer that I raise a separate PR on other unit test cases then I can split it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the additional unit tests, will raise a separate PR for them
try (Reader reader = ORC.newFileReader(file.toInputFile(), conf)) { | ||
List<StripeInformation> stripes = reader.getStripes(); | ||
try { | ||
List<StripeInformation> stripes = writer.getStripes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks great to me. Thanks for updating this, @pavibhai!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good other than an error message and some tests for unrelated cases.
Assert.assertEquals("Written records should match", overflowRecords, writtenRecords); | ||
|
||
// Expect bounds on LONG column | ||
Assert.assertTrue("Should have a valid lower bound", dataFile.lowerBounds().containsKey(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assert.assertTrue("Should have a valid lower bound", dataFile.lowerBounds().containsKey(1)); | |
Assertions.assertThat(dataFile.lowerBounds()).containsKey(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will show you the content of the map if the assertion ever fails, rather than just showing true != false.
would be great to apply this to the other places as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nastra as suggested earlier I am removing the additional tests and will raise a separate PR. Will incorporate this feedback into that PR.
- Updated error message - Removed tests related to bound checks
@rdblue I have addressed the review comments can you please take another look at this? |
Thanks, @pavibhai! Looks great so I merged it. |
What?
Fixes ORCFileAppender such that it can retrieve the stripe offsets from the writer instead of having to reopen the file.
Why?
From ORC 1.7 we have a public API for getting stripe information from the writer. This avoids an additional IO to get stripe offsets
Tests
Added unit test to verify that the offsets are matching the offsets read from the file