Skip to content

Conversation

@a10y
Copy link
Contributor

@a10y a10y commented Aug 7, 2025

Fixes: #4160

Extend PyVortex write API to accept PyArrow Tables and RecordBatchReaders
directly, enabling streaming writes without loading entire datasets into memory.

Key changes:

  • Support PyArrow Table and RecordBatchReader objects via ArrowArrayStreamReader
  • Stream Arrow RecordBatches directly to Vortex ArrayIterator

claude bot and others added 2 commits August 7, 2025 20:46
Extend PyVortex write API to accept PyArrow Tables and RecordBatchReaders
directly, enabling streaming writes without loading entire datasets into memory.

Key changes:
- Add Arrow FFI stream conversion in PyIntoArrayIterator::extract_bound()
- Support PyArrow Table and RecordBatchReader objects via ArrowArrayStreamReader
- Stream Arrow RecordBatches directly to Vortex ArrayIterator
- Update documentation with streaming examples
- Add test script demonstrating the new functionality

This resolves the issue where users needed entire datasets in memory
to write to Vortex files, enabling efficient ETL pipelines for 10B+ records.

Fixes: #4160

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: Nicholas Gates <gatesn@users.noreply.github.com>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@coveralls
Copy link

coveralls commented Aug 7, 2025

Coverage Status

coverage: 84.935% (+0.06%) from 84.877%
when pulling 04e9b0a on claude/issue-4160-20250807-2038
into ac773cb on develop.

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y added the feature Release label indicating a new feature or request label Aug 8, 2025
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y requested a review from gatesn August 8, 2025 14:17
@a10y a10y enabled auto-merge (squash) August 8, 2025 14:17
@a10y a10y changed the title feat(python): Add Arrow FFI streaming support to write API feat(python): Write PyArrow types directly to Vortex Aug 8, 2025
@a10y a10y merged commit f4d03a1 into develop Aug 8, 2025
37 checks passed
@a10y a10y deleted the claude/issue-4160-20250807-2038 branch August 8, 2025 14:30
@kylebarron
Copy link

kylebarron commented Aug 8, 2025

👋 Just saw this and have a small suggestion: this current implementation imports pyarrow explicitly and thus will only work if pyarrow itself is installed in the user's environment. But since Arrow is a stable format, you really don't need pyarrow; this is why the Arrow PyCapsule Interface was created. Any Python object that exposes an __arrow_c_stream__ method can be called to get a PyCapsule with an Arrow C Stream pointer.

There are a bunch of libraries that already support this, so if you implemented support for the PyCapsule interface then you could take in data from Polars/DuckDB/DataFusion etc even if pyarrow isn't installed in the environment.

You can use pyo3-arrow (disclaimer: my library) which has some benefits over the arrow-rs Python integration. Or, I think the arrow-rs integration should use it automatically, and you shouldn't need to import pyarrow explicitly.

(I could make an issue to discuss this if you prefer)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Release label indicating a new feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

How do I append to a vortex file from python using io.write?

5 participants