-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-37002: [C++][Parquet] Add api to get RecordReader from RowGroupReader #37003
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally ok with this patch. Would you mind add GH-37002
as a prefix?
cpp/src/parquet/file_reader.h
Outdated
@@ -58,6 +59,11 @@ class PARQUET_EXPORT RowGroupReader { | |||
// column. Ownership is shared with the RowGroupReader. | |||
std::shared_ptr<ColumnReader> Column(int i); | |||
|
|||
// Construct a RecordReader for the indicated row group-relative column i. | |||
// Ownership is shared with the RowGroupReader. | |||
std::shared_ptr<internal::RecordReader> RecordReader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emm I'm ok with the logic here, but is it ok to expose internal::
logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt tests checking exposure of internal namespace will complain about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should remove the RecordReader from the internal namespace at some point. I am not sure how that is decided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with moving it out of the internal namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I think it is better to remove the internal namespace in a separate pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to check this in first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, I'd prefer to move RecordReader out of internal namespace first. Because we can know the consequence of namespace change before exposing it.
|
cpp/src/parquet/file_reader.h
Outdated
@@ -58,6 +59,11 @@ class PARQUET_EXPORT RowGroupReader { | |||
// column. Ownership is shared with the RowGroupReader. | |||
std::shared_ptr<ColumnReader> Column(int i); | |||
|
|||
// Construct a RecordReader for the indicated row group-relative column i. | |||
// Ownership is shared with the RowGroupReader. | |||
std::shared_ptr<internal::RecordReader> RecordReader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, I'd prefer to move RecordReader out of internal namespace first. Because we can know the consequence of namespace change before exposing it.
I've check the codebase and find somewhere export the internal in |
ping @fatemehp , would we move forward? |
I am sorry I got distracted for a while. I will make the changes requested and update here. @mapleFU are you suggesting that it is not necessary to remove the internal namespace? |
the underlying record reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally LGTM. We're at national holiday so might reply later.
Also cc @pitrou for review
@wgtmac I tried to address the comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM.
The CI failures should be unrelated. I have retriggered failed ones.
Let's wait a while for opinions from @pitrou @emkornfield |
@wgtmac I'll let you handle this if you don't mind. |
No problem! @pitrou |
I think if we are going to be exposing this publicly we should remove the internal namespace from it. This can likely be done as a follow-up. Otherwise seems good to me. |
Thanks @emkornfield! Then I will merge it as is. Please remove the internal namespace in a follow-up change if you have time. Thanks! @fatemehp |
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit d3a576d. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…oupReader (apache#37003) Currently we only can get a ColumnReader for a column from RowGroupReader. We need an API to return a RecordReader for the column. Moved ComputeLevelInfo from column_writer to level_conversion so that it can be shared between column_writer and file_reader. * Closes: apache#37002 Lead-authored-by: Fatemah Panahi <panahi@google.com> Co-authored-by: Fatemah Panahi <fatemehp@users.noreply.github.com> Signed-off-by: Gang Wu <ustcwg@gmail.com>
…oupReader (apache#37003) Currently we only can get a ColumnReader for a column from RowGroupReader. We need an API to return a RecordReader for the column. Moved ComputeLevelInfo from column_writer to level_conversion so that it can be shared between column_writer and file_reader. * Closes: apache#37002 Lead-authored-by: Fatemah Panahi <panahi@google.com> Co-authored-by: Fatemah Panahi <fatemehp@users.noreply.github.com> Signed-off-by: Gang Wu <ustcwg@gmail.com>
…oupReader (apache#37003) Currently we only can get a ColumnReader for a column from RowGroupReader. We need an API to return a RecordReader for the column. Moved ComputeLevelInfo from column_writer to level_conversion so that it can be shared between column_writer and file_reader. * Closes: apache#37002 Lead-authored-by: Fatemah Panahi <panahi@google.com> Co-authored-by: Fatemah Panahi <fatemehp@users.noreply.github.com> Signed-off-by: Gang Wu <ustcwg@gmail.com>
Currently we only can get a ColumnReader for a column from RowGroupReader. We need an API to return a RecordReader for the column.
Moved ComputeLevelInfo from column_writer to level_conversion so that it can be shared between column_writer and file_reader.