-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pyarrow: Preserve RecordBatch's schema metadata #5355
Conversation
PyArrow's RecordBatch gets imported by using a StructArray, then transforming it into a RecordBatch. This was losing the Schema's metadata in the process (because a StructArray does not hold metadata). This commit changes the test to show the issue, and fixes it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. This seems plausible to me, but I'm not very familiar with this interface
arrow/src/pyarrow.rs
Outdated
let schema = Arc::new(Schema::try_from(schema_ptr).map_err(to_py_err)?); | ||
return RecordBatch::from(array) | ||
.with_schema(schema) | ||
.map_err(to_py_err); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of first inferring a schema from the record batch and then overwriting it, maybe it would be clearer to access the columns and then use RecordBatch::try_new
?
let schema = Arc::new(Schema::try_from(schema_ptr).map_err(to_py_err)?);
let (_fields, columns, nulls) = array.into_parts();
// Assert null count is 0
return RecordBatch::try_new(schema, columns).map_err(to_py_err);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent point, I should have looked at the conversion code from StructArray
. Fixed.
I have more changes pending that use |
Thanks everyone |
FTR, the C++ side was fixed now. |
Great, will create a quick pr tomorrow with the better/fixed tests. We may as well use a proper equality test. |
PyArrow's RecordBatch gets imported by using a StructArray, then transforming it into a RecordBatch. This was losing the Schema's metadata in the process (because a StructArray does not hold metadata).
This commit changes the test to show the issue, and fixes it.
Which issue does this PR close?
Closes #5354.
What changes are included in this PR?
Changes pyarrow tests to ensure that schema equality also checks for metadata equality.
Fixes
impl FromPyArrow for RecordBatch
to ensure that the resulting RecordBatch has the right schemaAre there any user-facing changes?
No