-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-14932: [Python] Add python bindings for JSON streaming reader #33761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
@AlenkaF @jorisvandenbossche would one of you like to help @akshaysu12 on this? |
Sure, I would love to help! Will look at the issue and the current state of the PR tomorrow and hope I get through it. |
|
Great work so far @akshaysu12 ! As you mentioned, you should add the documentation to the https://arrow.apache.org/docs/dev/python/api/formats.html (docs/source/python/api/formats.rst)and https://arrow.apache.org/docs/dev/python/json.html (docs/source/python/json.rst). To build the documentation locally see: https://arrow.apache.org/docs/dev/developers/documentation.html You should also correct the linter error: --- original//arrow/python/pyarrow/tests/test_json.py
+++ fixed//arrow/python/pyarrow/tests/test_json.py
@@ -106,6 +106,7 @@
check_options_class_pickling(cls, explicit_schema=schema,
newlines_in_values=False,
unexpected_field_behavior="ignore")
+
class BaseTestJSON(abc.ABC):
@abc.abstractmethod
@@ -296,11 +297,12 @@
# Better error output
assert table.to_pydict() == expected.to_pydict()
+
class BaseTestJSONRead(BaseTestJSON):
def read_bytes(self, b, **kwargs):
return self.read_json(pa.py_buffer(b), **kwargs)
-
+
def test_file_object(self):
data = b'{"a": 1, "b": 2}\n'
expected_data = {'a': [1], 'b': [2]}
@@ -311,7 +313,7 @@
sio = io.StringIO(data.decode())
with pytest.raises(TypeError):
self.read_json(sio)
-
+
def test_reconcile_accross_blocks(self):
# ARROW-12065: reconciling inferred types across blocks
first_row = b'{ }\n'To run the linter locally see: https://arrow.apache.org/docs/dev/developers/python.html#coding-style |
|
@AlenkaF thanks for the feedback! Sorry I've been sitting on this a bit because I'm not sure what to do about the read_options.use_threads. It looks like the one major difference from the CSV streaming reader. #14355 indicates multiple threads can be used but there's no explicit documentation at
|
I think the reference docs have all the info: It makes sense for the CSV stream reader to have notes about only single-threaded reader and that there are no additional notes on
Oh yes, That is true.
I think you can add something similar to
I think it is well documented in pa.json.ReadOptions. But you can definitely mention the possibility of using multiple threads in JSON incremental reading in the general docs. |
|
Hi, I'm currently encountering issues that could be resolved by the changes proposed in this PR. |
) ### Rationale for this change The C++ arrow has a JSON streaming reader which is not exposed on the Python interface. ### What changes are included in this PR? This PR is based on #33761. It adds the `open_json` method to open a streaming reader for a JSON file. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes. A new `open_json` method has been added to the Python interface, located at `pyarrow.json.open_json`, and its parameters are the same as the `pyarrow.json.read_json` * GitHub Issue: #14932 Lead-authored-by: pxc <panxuchen.pxc@alibaba-inc.com> Co-authored-by: Akshay Subramanian <asubramanian@grailbio.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
|
I will be closing this PR as the issue was already closed by the following: #45084. |
What changes are included in this PR?
This PR adds a new python function open_json() that allows for opening a streaming reader to a json file. Arguments for open_json() are the same as for read_json().
open_json() is the python binding for the streaming json reader implemented in this PR: #14355