Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyarrow: Preserve RecordBatch's schema metadata #5355

Merged
merged 2 commits into from
Feb 4, 2024

Conversation

atwam
Copy link
Contributor

@atwam atwam commented Jan 31, 2024

PyArrow's RecordBatch gets imported by using a StructArray, then transforming it into a RecordBatch. This was losing the Schema's metadata in the process (because a StructArray does not hold metadata).

This commit changes the test to show the issue, and fixes it.

Which issue does this PR close?

Closes #5354.

What changes are included in this PR?

Changes pyarrow tests to ensure that schema equality also checks for metadata equality.
Fixes impl FromPyArrow for RecordBatch to ensure that the resulting RecordBatch has the right schema

Are there any user-facing changes?

No

PyArrow's RecordBatch gets imported by using a StructArray, then
transforming it into a RecordBatch. This was losing the Schema's
metadata in the process (because a StructArray does not hold
metadata).

This commit changes the test to show the issue, and fixes it.
@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 31, 2024
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. This seems plausible to me, but I'm not very familiar with this interface

Comment on lines 360 to 363
let schema = Arc::new(Schema::try_from(schema_ptr).map_err(to_py_err)?);
return RecordBatch::from(array)
.with_schema(schema)
.map_err(to_py_err);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of first inferring a schema from the record batch and then overwriting it, maybe it would be clearer to access the columns and then use RecordBatch::try_new?

let schema = Arc::new(Schema::try_from(schema_ptr).map_err(to_py_err)?);
let (_fields, columns, nulls) = array.into_parts();
// Assert null count is 0
return RecordBatch::try_new(schema, columns).map_err(to_py_err);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point, I should have looked at the conversion code from StructArray. Fixed.

@atwam
Copy link
Contributor Author

atwam commented Jan 31, 2024

I have more changes pending that use schema.equals(other, check_metadata=True) for testing, but these fail because of apache/arrow#39865. Will wait for the cpp code to be fixed (see discussion in #5355 )

@alamb alamb merged commit f303c9e into apache:master Feb 4, 2024
22 checks passed
@alamb
Copy link
Contributor

alamb commented Feb 4, 2024

Thanks everyone

@pitrou
Copy link
Member

pitrou commented Feb 5, 2024

I have more changes pending that use schema.equals(other, check_metadata=True) for testing, but these fail because of apache/arrow#39865. Will wait for the cpp code to be fixed

FTR, the C++ side was fixed now.

@atwam
Copy link
Contributor Author

atwam commented Feb 5, 2024

I have more changes pending that use schema.equals(other, check_metadata=True) for testing, but these fail because of apache/arrow#39865. Will wait for the cpp code to be fixed

FTR, the C++ side was fixed now.

Great, will create a quick pr tomorrow with the better/fixed tests. We may as well use a proper equality test.
Thanks a lot for your help here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RecordBatch conversion from pyarrow loses Schema's metadata
5 participants