Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Add a method to directly create an arrow::RecordBatch #132

Closed
alamb opened this issue Feb 5, 2024 · 7 comments
Closed

Proposal: Add a method to directly create an arrow::RecordBatch #132

alamb opened this issue Feb 5, 2024 · 7 comments

Comments

@alamb
Copy link
Contributor

alamb commented Feb 5, 2024

Again, thank you for this library.

arrow makes heavy use of RecordBatch (while arrow2 removed the equivalent concept and deals in Vecs of Arrays).

It is possible to make RecordBatch from the APIs in this crate, as I demonstrate in #131, I think it would a nicer API if this crate offered a convenience method for doing so directly

Here is what you have to do today (copied from #131)

// Determine Arrow schema
let fields =
  SerdeArrowSchema::from_type::<Record>(TracingOptions::default())?
  .to_arrow_fields()

// Convert to Arrow arrays
let arrays = serde_arrow::to_arrow(&fields, &records)?;

// Form a RecordBatch
let schema = Schema::new(&fields);
let batch = RecordBatch::try_new(schema.into(), arrays)?;

Proposal

I propose adding a to_batch method that coverts directly to arrow RecordBatch:

pub fn to_batch<T: Serialize + ?Sized>(fields: &[Field], items: &T) -> Result<RecordBatch> {
...
}

Example use:

// Determine Arrow schema
let fields =
  SerdeArrowSchema::from_type::<Record>(TracingOptions::default())?
  .to_arrow_fields()

// Convert to RecordBatch
let batch = serde_arrow::to_batch(&fields, &records)?;
@alamb
Copy link
Contributor Author

alamb commented Feb 5, 2024

If this is acceptable to you @chmp I would be happy to create a PR

@chmp
Copy link
Owner

chmp commented Feb 5, 2024

Hey @alamb.

Thanks for the offer, in principle I would be open to an PR, but only reluctantly.

To explain my reservations: due to the constant API churn I consider a larger exposure to the arrow API a risk. Initially serde_arrow was built around the arrow API, but after multiple updates breaking my code (within a six week timeframe), I took the decision to minimize the arrow API used. That's the reason why all builders are implemented from scratch. And that's the reason why I removed the RecordBatch based API in serde_arrow some time ago.

The regular major version release is also the reason why this whole non-additive feature issue exists (at least I didn't think of a solution yet that does not involve releasing a breaking change every 6 weeks myself).

@alamb
Copy link
Contributor Author

alamb commented Feb 5, 2024

No worries -- your explantion makes sense. I think having an example (e.g. #131) would provide 90% of the value. Maybe over time depending on how peple use the crate it would make sense to offer additional APIs, but I can see the value in keeping things simple

@chmp
Copy link
Owner

chmp commented Feb 5, 2024

Btw. the to_arrow function already includes the corresponding docs. But better to include a more prominent example.

@chmp
Copy link
Owner

chmp commented Feb 29, 2024

See also #137

@chmp
Copy link
Owner

chmp commented Mar 28, 2024

Added a to_record_batch on main

@chmp chmp closed this as completed Mar 28, 2024
@chmp
Copy link
Owner

chmp commented Apr 3, 2024

Released as serde_arrow=0.11.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants