Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orc: Closes #5777 - Obtain ORC stripe offsets from writer #5778

Merged
merged 2 commits into from
Sep 28, 2022

Conversation

pavibhai
Copy link
Contributor

@pavibhai pavibhai commented Sep 16, 2022

What?

Fixes ORCFileAppender such that it can retrieve the stripe offsets from the writer instead of having to reopen the file.

Why?

From ORC 1.7 we have a public API for getting stripe information from the writer. This avoids an additional IO to get stripe offsets

Tests

Added unit test to verify that the offsets are matching the offsets read from the file

@pavibhai pavibhai force-pushed the 5777_writer_stripe_offsets branch from 45ef127 to b7350b6 Compare September 16, 2022 18:59
@pavibhai pavibhai marked this pull request as draft September 16, 2022 19:20
@pavibhai pavibhai force-pushed the 5777_writer_stripe_offsets branch from b7350b6 to 7da48bc Compare September 16, 2022 20:32
@pavibhai pavibhai changed the title Core: Closes #5777 - Obtain ORC stripe offsets from writer Orc: Closes #5777 - Obtain ORC stripe offsets from writer Sep 16, 2022
@pavibhai pavibhai marked this pull request as ready for review September 16, 2022 21:26
@pavibhai pavibhai force-pushed the 5777_writer_stripe_offsets branch from 7da48bc to e2cfd13 Compare September 19, 2022 22:30
@github-actions github-actions bot added the build label Sep 19, 2022
@pavibhai pavibhai force-pushed the 5777_writer_stripe_offsets branch from e2cfd13 to 441b327 Compare September 19, 2022 22:34
@pavibhai
Copy link
Contributor Author

@kbendick Hope you are still looking at ORC support on Iceberg. Can you please take a look at this PR?

return Collections.unmodifiableList(Lists.transform(stripes, StripeInformation::getOffset));
} catch (IOException e) {
throw new RuntimeIOException(e, "Can't close ORC reader %s", file.location());
throw new RuntimeIOException(
e, "Cannot receive stripe information from writer for %s", file.location());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Cannot receive" seems like a strange way to state that getStripes failed. What about Failed to get stripe information from writer for: %s?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will change.


@SuppressWarnings("checkstyle:AvoidEscapedUnicodeCharacters")
@Test
public void testInvalidUpperBoundString() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to this PR? What is the purpose of this test?

Copy link
Contributor Author

@pavibhai pavibhai Sep 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that there was no unit test on ORC write in the project. I ported the current parquet test and introduced the check on offsets.

If you would prefer that I raise a separate PR on other unit test cases then I can split it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the additional unit tests, will raise a separate PR for them

try (Reader reader = ORC.newFileReader(file.toInputFile(), conf)) {
List<StripeInformation> stripes = reader.getStripes();
try {
List<StripeInformation> stripes = writer.getStripes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks great to me. Thanks for updating this, @pavibhai!

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good other than an error message and some tests for unrelated cases.

Assert.assertEquals("Written records should match", overflowRecords, writtenRecords);

// Expect bounds on LONG column
Assert.assertTrue("Should have a valid lower bound", dataFile.lowerBounds().containsKey(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Assert.assertTrue("Should have a valid lower bound", dataFile.lowerBounds().containsKey(1));
Assertions.assertThat(dataFile.lowerBounds()).containsKey(1));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will show you the content of the map if the assertion ever fails, rather than just showing true != false.
would be great to apply this to the other places as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nastra as suggested earlier I am removing the additional tests and will raise a separate PR. Will incorporate this feedback into that PR.

- Updated error message
- Removed tests related to bound checks
@pavibhai
Copy link
Contributor Author

Looks good other than an error message and some tests for unrelated cases.

@rdblue I have addressed the review comments can you please take another look at this?

@rdblue rdblue merged commit 75707d0 into apache:master Sep 28, 2022
@rdblue
Copy link
Contributor

rdblue commented Sep 28, 2022

Thanks, @pavibhai! Looks great so I merged it.

sunchao pushed a commit to sunchao/iceberg that referenced this pull request May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants