Skip to content

Conversation

@akshaysu12
Copy link

@akshaysu12 akshaysu12 commented Jan 18, 2023

What changes are included in this PR?

This PR adds a new python function open_json() that allows for opening a streaming reader to a json file. Arguments for open_json() are the same as for read_json().

open_json() is the python binding for the streaming json reader implemented in this PR: #14355

@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #14932 has been automatically assigned in GitHub to PR creator.

@pitrou
Copy link
Member

pitrou commented Jan 18, 2023

@AlenkaF @jorisvandenbossche would one of you like to help @akshaysu12 on this?

@AlenkaF
Copy link
Member

AlenkaF commented Jan 18, 2023

@AlenkaF @jorisvandenbossche would one of you like to help @akshaysu12 on this?

Sure, I would love to help! Will look at the issue and the current state of the PR tomorrow and hope I get through it.

@AlenkaF
Copy link
Member

AlenkaF commented Jan 19, 2023

Great work so far @akshaysu12 !
Looking at the code from the binding for CSVStreamingReader and the work done on the C++ side for the JSON stream reader I would say this PR is already in a good shape.

As you mentioned, you should add the documentation to the https://arrow.apache.org/docs/dev/python/api/formats.html (docs/source/python/api/formats.rst)and https://arrow.apache.org/docs/dev/python/json.html (docs/source/python/json.rst).

To build the documentation locally see: https://arrow.apache.org/docs/dev/developers/documentation.html

You should also correct the linter error:

--- original//arrow/python/pyarrow/tests/test_json.py
+++ fixed//arrow/python/pyarrow/tests/test_json.py
@@ -106,6 +106,7 @@
     check_options_class_pickling(cls, explicit_schema=schema,
                                  newlines_in_values=False,
                                  unexpected_field_behavior="ignore")
+
 
 class BaseTestJSON(abc.ABC):
     @abc.abstractmethod
@@ -296,11 +297,12 @@
                         # Better error output
                         assert table.to_pydict() == expected.to_pydict()
 
+
 class BaseTestJSONRead(BaseTestJSON):
 
     def read_bytes(self, b, **kwargs):
         return self.read_json(pa.py_buffer(b), **kwargs)
-    
+
     def test_file_object(self):
         data = b'{"a": 1, "b": 2}\n'
         expected_data = {'a': [1], 'b': [2]}
@@ -311,7 +313,7 @@
         sio = io.StringIO(data.decode())
         with pytest.raises(TypeError):
             self.read_json(sio)
-    
+
     def test_reconcile_accross_blocks(self):
         # ARROW-12065: reconciling inferred types across blocks
         first_row = b'{                               }\n'

To run the linter locally see: https://arrow.apache.org/docs/dev/developers/python.html#coding-style

@kou kou changed the title GH-14932: Add python bindings for JSON streaming reader GH-14932: [Python] Add python bindings for JSON streaming reader Jan 22, 2023
@akshaysu12
Copy link
Author

@AlenkaF thanks for the feedback! Sorry I've been sitting on this a bit because I'm not sure what to do about the read_options.use_threads. It looks like the one major difference from the CSV streaming reader.

#14355 indicates multiple threads can be used but there's no explicit documentation at docs/source/cpp/json.rst. The tests I've written in test_json.py pass if use_threads is False or True. It looks like cpp/src/arrow/json/reader_test.cc has more specific tests for the AsyncStreamingReader.

  1. Do I need to write additional tests more specific to read_options.use_threads = True?
  2. Do I need to include documentation on read_options.use_threads?

@AlenkaF
Copy link
Member

AlenkaF commented Jan 26, 2023

#14355 indicates multiple threads can be used but there's no explicit documentation at docs/source/cpp/json.rst.

I think the reference docs have all the info:
https://arrow.apache.org/docs/dev/cpp/api/formats.html#_CPPv4N5arrow4json11ReadOptionsE
https://arrow.apache.org/docs/dev/cpp/api/formats.html#_CPPv4N5arrow4json15StreamingReaderE

It makes sense for the CSV stream reader to have notes about only single-threaded reader and that there are no additional notes on use_threads in Reading JSON files as ReadOptions are linked from there.

The tests I've written in test_json.py pass if use_threads is False or True. It looks like cpp/src/arrow/json/reader_test.cc has more specific tests for the AsyncStreamingReader.

Oh yes, That is true.

  1. Do I need to write additional tests more specific to read_options.use_threads = True?

I think you can add something similar to TestSerialJSONRead and TestParallelJSONRead. @jorisvandenbossche what do you think?

  1. Do I need to include documentation on read_options.use_threads?

I think it is well documented in pa.json.ReadOptions. But you can definitely mention the possibility of using multiple threads in JSON incremental reading in the general docs.

@pan-x-c
Copy link
Contributor

pan-x-c commented Dec 18, 2024

Hi, I'm currently encountering issues that could be resolved by the changes proposed in this PR.
The current PR has not progressed for a long time. Is there a plan to continue moving forward? I can help if needed.

pitrou pushed a commit that referenced this pull request Feb 6, 2025
)

### Rationale for this change

The C++ arrow has a JSON streaming reader which is not exposed on the Python interface.

### What changes are included in this PR?

This PR is based on #33761. It adds the `open_json` method to open a streaming reader for a JSON file.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Yes. A new `open_json` method has been added to the Python interface, located at `pyarrow.json.open_json`, and its parameters are the same as the `pyarrow.json.read_json`

* GitHub Issue: #14932

Lead-authored-by: pxc <panxuchen.pxc@alibaba-inc.com>
Co-authored-by: Akshay Subramanian <asubramanian@grailbio.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@AlenkaF
Copy link
Member

AlenkaF commented Feb 21, 2025

I will be closing this PR as the issue was already closed by the following: #45084.

@AlenkaF AlenkaF closed this Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python] Expose streaming JSON reader

5 participants