Skip to content

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Feb 19, 2022

Which issue does this PR close?

Closes #1338

Rationale for this change

Projection can avoid and loading it into arrays (this PR), and also could avoid reading it in the first place (not yet implemented).

What changes are included in this PR?

  • Changing the signature to try_new(reader: R, projection: Option<Vec<usize>>)
  • Change read_record_batch to avoid creating arrays for columns in the projection.

We do not yet skip reading the data in the first place.

Are there any user-facing changes?

Yes - adding a second parameter to FileReader::new and StreamReader::new

@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 19, 2022
@github-actions github-actions bot added the arrow-flight Changes to the arrow-flight crate label Feb 19, 2022
@Dandandan Dandandan marked this pull request as draft February 19, 2022 14:56
@Dandandan Dandandan added the api-change Changes to the arrow API label Feb 19, 2022
@codecov-commenter
Copy link

codecov-commenter commented Feb 19, 2022

Codecov Report

Merging #1339 (0341ad8) into master (f4c7102) will decrease coverage by 0.03%.
The diff coverage is 87.65%.

❗ Current head 0341ad8 differs from pull request most recent head b94ef98. Consider uploading reports for the commit b94ef98 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1339      +/-   ##
==========================================
- Coverage   83.03%   82.99%   -0.04%     
==========================================
  Files         181      181              
  Lines       52949    52985      +36     
==========================================
+ Hits        43965    43975      +10     
- Misses       8984     9010      +26     
Impacted Files Coverage Δ
arrow-flight/src/utils.rs 0.00% <0.00%> (ø)
arrow/src/array/data.rs 83.15% <ø> (-0.15%) ⬇️
arrow/src/array/mod.rs 100.00% <ø> (ø)
arrow/src/array/transform/mod.rs 84.65% <ø> (+0.26%) ⬆️
arrow/src/csv/writer.rs 71.32% <0.00%> (-0.82%) ⬇️
...ntegration-testing/src/bin/arrow-file-to-stream.rs 0.00% <0.00%> (ø)
...ion-testing/src/bin/arrow-json-integration-test.rs 0.00% <0.00%> (ø)
...ntegration-testing/src/bin/arrow-stream-to-file.rs 0.00% <0.00%> (ø)
...ng/src/flight_server_scenarios/integration_test.rs 0.00% <0.00%> (ø)
parquet/src/arrow/array_reader.rs 78.27% <0.00%> (ø)
... and 36 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f4c7102...b94ef98. Read the comment docs.

@Dandandan Dandandan marked this pull request as ready for review February 19, 2022 17:39
@Dandandan Dandandan requested review from alamb and nevi-me February 19, 2022 19:18
@Dandandan
Copy link
Contributor Author

Any thoughts @alamb @nevi-me ?

let projection = projection.map(|projection| {
let fields = projection
.iter()
.map(|x| schema.fields[*x].clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it fine if this panics if x > fields.len()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I think here needs some extra check on the projection values to avoid panics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could reuse Schema::project: https://docs.rs/arrow/9.1.0/arrow/datatypes/struct.Schema.html#method.project

(which also handles metadata correctly)

let projection = projection.map(|projection| {
let fields = projection
.iter()
.map(|x| schema.fields[*x].clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could reuse Schema::project: https://docs.rs/arrow/9.1.0/arrow/datatypes/struct.Schema.html#method.project

(which also handles metadata correctly)

/// This value is set to `true` the first time the reader's `next()` returns `None`.
finished: bool,

/// Optional projection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Optional projection
/// Optional projection and projected schema

// Create an array of optional dictionary value arrays, one per field.
let dictionaries_by_field = vec![None; schema.fields().len()];

let projection = projection.map(|projection| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here -- Schema::projection might make this code easier to read

self.reader.read_exact(&mut buf)?;

read_record_batch(&buf, batch, self.schema(), &self.dictionaries_by_field).map(Some)
read_record_batch(&buf, batch, self.schema(), &self.dictionaries_by_field, self.projection.as_ref().map(|x| x.0.as_ref())).map(Some)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it is worth adding something like

impl<R: Read> StreamReader<R> {
...
  /// get projected schema, if any
  pub fn projected_schema(&self) -> Option<&Schema> {
    ...
  }

@alamb
Copy link
Contributor

alamb commented Feb 28, 2022

FWIW I think better error handling could be added as a follow on PR too

Nice work @Dandandan

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@alamb
Copy link
Contributor

alamb commented Mar 6, 2022

I am sorry this one missed the cutoff for arrow 10.0.0; Perhaps we should merge it in and do the cleanups as a follow on PR

Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job !
use project() and handle error is ok()

@Dandandan Dandandan merged commit 4bcc7a6 into apache:master Mar 9, 2022
@alamb alamb changed the title Implement projection for arrow file / streams Implement projection for arrow IPC Reader file / streams Mar 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Arrow IPC projection support

5 participants