-
Couldn't load subscription status.
- Fork 3k
Fix ArrowInvalid for large variables #7825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix ArrowInvalid for large variables #7825
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good ! let me run the CI so we can see if it holds
src/datasets/arrow_writer.py
Outdated
| allow_64bit = False | ||
| for field_type in self._schema.types: | ||
| if pa.types.is_large_list(field_type): | ||
| allow_64bit = True | ||
| break | ||
| self.pa_writer = pa.RecordBatchStreamWriter( | ||
| self.stream, self._schema, options=pa.ipc.IpcWriteOptions(allow_64bit=allow_64bit) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can always set allow_64bit=True ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
…asets into fix-value-too-large-7821
|
@lhoestq : from what I see in the CI logs ( tests/test_arrow_dataset.py::test_map_int32_overflow Shall we keep the test, remove ti, or skip it, or is there a specific pytest flag you have for slow/memory-consuming tests I can add, so that it gets skipped automatically on the CI but might be run locally? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, updated so that we allow 64bit all the time
src/datasets/arrow_writer.py
Outdated
| allow_64bit = False | ||
| for field_type in self._schema.types: | ||
| if pa.types.is_large_list(field_type): | ||
| allow_64bit = True | ||
| break | ||
| self.pa_writer = pa.RecordBatchStreamWriter( | ||
| self.stream, self._schema, options=pa.ipc.IpcWriteOptions(allow_64bit=allow_64bit) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Fixes #7821
In addition to the solution proposed in the issue, I encountered we also need to support 64bit types when writing using the
RecordBatchStreamWriter.Not sure if we want to create such large objects in the CI, but this is the only way to test that the issue is fixed, therefore I added the unit test.