Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orc: Support row group bloom filters #5313

Merged
merged 5 commits into from
Oct 20, 2022
Merged

Orc: Support row group bloom filters #5313

merged 5 commits into from
Oct 20, 2022

Conversation

a49a
Copy link
Contributor

@a49a a49a commented Jul 20, 2022

Issue link: #5218

Add write.orc.bloom.filter.columns and write.orc.bloom.filter.fpp options.
Enable these options in ORC class.

@a49a a49a marked this pull request as draft July 20, 2022 09:02
@a49a a49a changed the title Orc: Support row group bloom filters [WIP] Orc: Support row group bloom filters Jul 20, 2022
@a49a a49a changed the title [WIP] Orc: Support row group bloom filters Orc: Support row group bloom filters Aug 3, 2022
@a49a a49a marked this pull request as ready for review August 3, 2022 06:57
@a49a
Copy link
Contributor Author

a49a commented Aug 5, 2022

cc @rdblue @RussellSpitzer @huaxingao @kbendick Please review it, thanks a lot.

@huaxingao
Copy link
Contributor

For the test, shall we also check the read path to make sure bloom filters are there?

@a49a
Copy link
Contributor Author

a49a commented Aug 22, 2022

For the test, shall we also check the read path to make sure bloom filters are there?

I find the ORC SDK will auto recognize the bloom filter in the following code. Is it necessary to test the logic of ORC itself? IMO, we don't have to test it. There have been some tests in the ORC project. @huaxingao

https://github.com/apache/orc/blob/a49f87d492aa62ec2be2d4bce2fcfe1f53ca05d9/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L890-L912

@a49a a49a requested review from kbendick and huaxingao and removed request for kbendick and huaxingao August 23, 2022 01:56
@huaxingao
Copy link
Contributor

LGTM. Thanks a lot @deadwind4 for working on this!
cc @kbendick @rdblue

Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, outside of the two nits on the config values.

Thank you for working on this @deadwind4!

@a49a
Copy link
Contributor Author

a49a commented Aug 26, 2022

Thank a lot @huaxingao @kbendick for reviewing this.

@a49a a49a requested a review from kbendick August 29, 2022 01:52
@a49a
Copy link
Contributor Author

a49a commented Sep 26, 2022

@kbendick Could we merge it?

@pvary
Copy link
Contributor

pvary commented Sep 28, 2022

@deadwind4: I understand, that the ORC supports these filters and tests the functions, but I would still like to see some tests validating, that the filters are there. I could envision a situation where ORC changes the way to support bloom filters, and our tests fails to recognize this.
If, and only if, there is an easy way to check it, then it would be good to have a test here as well to validate that the filters are created.

Thanks, Peter

@github-actions github-actions bot added the build label Oct 19, 2022
@a49a
Copy link
Contributor Author

a49a commented Oct 19, 2022

@pvary Thank you for your feedback. I have added a test case to validate bloom filters in ORC files and push a new commit.
@kbendick @pvary Please review this. Thanks a lot.

@pvary
Copy link
Contributor

pvary commented Oct 20, 2022

@deadwind4: Could you please check the failed tests.

Thanks,
Peter

@a49a
Copy link
Contributor Author

a49a commented Oct 20, 2022

@pvary All checks have passed after I rebased the code and reran the CI.

@pvary pvary merged commit e456506 into apache:master Oct 20, 2022
@pvary
Copy link
Contributor

pvary commented Oct 20, 2022

Thanks @deadwind4 for the PR and @kbendick for the review!

@chenwyi2
Copy link

ORC_BLOOM_FILTER_COLUMNS this property will work on spark? when i set write.orc.bloom.filter.columns=xx and used spark to write data, i found that bloomfilter had no effect on querying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants