Skip to content

Conversation

@benibus
Copy link
Contributor

@benibus benibus commented Oct 9, 2022

See: ARROW-17932.

Adds a json::StreamingReader class (modeled after csv::StreamingReader) with an async-reentrant interface and support for parallel block decoding.

Some parts of the existing TableReader implementation have been refactored to utilize the new facilities.

@github-actions
Copy link

github-actions bot commented Oct 9, 2022

@benibus
Copy link
Contributor Author

benibus commented Oct 10, 2022

A couple notes on this:

  • I decided to hold off on documenting threading/reentrancy characteristics until I could get a second opinion, as this was my first exposure to the project's async facilities. As it's written, the stated limitations of the CSV equivalent should still apply (no async-reentrancy, no multi-threading). However, here, it seems possible to launch the parsing/decoding tasks in parallel once a chunked block has been serially generated. There may be complexities I'm not considering though.
  • Test coverage handles the common cases, but likely isn't comprehensive enough - although some of it should be handled by the existing TableReader tests since it uses common code.

@pitrou
Copy link
Member

pitrou commented Oct 12, 2022

Some answers to the points above:

  • It seems like this may be async-reentrant (and theoretically could be, as you point out)

  • Tests should indeed be complemented with: 1) tests for errors 2) tests with multiple input blocks 3) stress tests, especially async-reentrant calls to ReadNextAsync (MakeReadaheadGenerator may help with that)

@benibus benibus marked this pull request as draft October 27, 2022 18:59
@benibus benibus force-pushed the ARROW-17932-json-rb-reader branch 2 times, most recently from c579b03 to cba7028 Compare November 15, 2022 01:30
@benibus benibus marked this pull request as ready for review November 16, 2022 00:45
Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still learning the async utilities in Arrow, and this PR was a very helpful read. 😄 Had two questions on it.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was an interesting read! The general structure looks sound, here are some comments.

@pitrou
Copy link
Member

pitrou commented Nov 30, 2022

By the way, can you also:

  1. rebase/merge from master
  2. add documentation for the new class in https://github.com/apache/arrow/blob/master/docs/source/cpp/json.rst and https://github.com/apache/arrow/blob/master/docs/source/cpp/api/formats.rst#line-separated-json
    ?

@benibus benibus force-pushed the ARROW-17932-json-rb-reader branch 2 times, most recently from 418677b to 1c60039 Compare December 1, 2022 17:19
@benibus
Copy link
Contributor Author

benibus commented Dec 1, 2022

@pitrou I still have to do the docs, but feel free to bring up any further points on the code in the meantime.

@benibus
Copy link
Contributor Author

benibus commented Dec 5, 2022

I just opened a small PR that addresses the chunker issue: #14843. If that gets merged, I have a test I can add for the newlines_in_values = true case (which would crash otherwise). Then, this will be ready for re-review.

Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding docs!

@benibus benibus force-pushed the ARROW-17932-json-rb-reader branch from 8ba2260 to 4c7a5a7 Compare December 5, 2022 20:23
@benibus benibus requested review from pitrou and removed request for bkietz December 5, 2022 21:33
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!

@benibus
Copy link
Contributor Author

benibus commented Dec 7, 2022

@pitrou Just pushed a fix.

Also, not sure if my rationale on the std::remove table validation stuff from yesterday is sound - but if not, I can change that too.

@benibus benibus force-pushed the ARROW-17932-json-rb-reader branch from 18e5d3a to 71db481 Compare December 12, 2022 17:14
@benibus benibus requested a review from pitrou December 12, 2022 17:17
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks really good now. Great work @benibus !

@pitrou pitrou merged commit fe5c9fe into apache:master Dec 13, 2022
@ursabot
Copy link

ursabot commented Dec 14, 2022

Benchmark runs are scheduled for baseline = 375372b and contender = fe5c9fe. fe5c9fe is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.13% ⬆️0.23%] test-mac-arm
[Finished ⬇️1.36% ⬆️4.62%] ursa-i9-9960x
[Finished ⬇️14.03% ⬆️9.62%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] fe5c9fe3 ec2-t3-xlarge-us-east-2
[Finished] fe5c9fe3 test-mac-arm
[Finished] fe5c9fe3 ursa-i9-9960x
[Finished] fe5c9fe3 ursa-thinkcentre-m75q
[Finished] 375372b7 ec2-t3-xlarge-us-east-2
[Finished] 375372b7 test-mac-arm
[Finished] 375372b7 ursa-i9-9960x
[Finished] 375372b7 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Dec 14, 2022

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants