Skip to content

Conversation

@CloseChoice
Copy link
Contributor

Fixes #7821

In addition to the solution proposed in the issue, I encountered we also need to support 64bit types when writing using the RecordBatchStreamWriter.

Not sure if we want to create such large objects in the CI, but this is the only way to test that the issue is fixed, therefore I added the unit test.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good ! let me run the CI so we can see if it holds

Comment on lines 527 to 534
allow_64bit = False
for field_type in self._schema.types:
if pa.types.is_large_list(field_type):
allow_64bit = True
break
self.pa_writer = pa.RecordBatchStreamWriter(
self.stream, self._schema, options=pa.ipc.IpcWriteOptions(allow_64bit=allow_64bit)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can always set allow_64bit=True ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@CloseChoice
Copy link
Contributor Author

CloseChoice commented Oct 27, 2025

@lhoestq : from what I see in the CI logs ( tests/test_arrow_dataset.py::test_map_int32_overflow
/opt/hostedtoolcache/Python/3.9.23/x64/lib/python3.9/site-packages/multiprocess/resource_tracker.py:219: UserWarning: resource_tracker: There appear to be 24 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '), see here, the test crashes probably since it consumes too much memory, it works locally for me though.

Shall we keep the test, remove ti, or skip it, or is there a specific pytest flag you have for slow/memory-consuming tests I can add, so that it gets skipped automatically on the CI but might be run locally?

Copy link
Contributor Author

@CloseChoice CloseChoice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, updated so that we allow 64bit all the time

Comment on lines 527 to 534
allow_64bit = False
for field_type in self._schema.types:
if pa.types.is_large_list(field_type):
allow_64bit = True
break
self.pa_writer = pa.RecordBatchStreamWriter(
self.stream, self._schema, options=pa.ipc.IpcWriteOptions(allow_64bit=allow_64bit)
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Building a dataset with large variable size arrays results in error ArrowInvalid: Value X too large to fit in C integer type

2 participants