On write operation, cast data to Iceberg Table's pyarrow schema#523
On write operation, cast data to Iceberg Table's pyarrow schema#523Fokko merged 8 commits intoapache:mainfrom
Conversation
24e7da0 to
05e7444
Compare
Fokko
left a comment
There was a problem hiding this comment.
This looks good to me, I left one comment but we can also defer that to a later PR.
pyiceberg/table/__init__.py
Outdated
| _check_schema(self.schema(), other_schema=df.schema) | ||
| _check_schema_compatible(self.schema(), other_schema=df.schema) | ||
| # the two schemas are compatible so safe to cast | ||
| df = df.cast(self.schema().as_arrow()) |
There was a problem hiding this comment.
Should _check_schema_compatible return a bool to indicate if the cast is needed? I'm not sure how costly the cast is. If we go from string to large_string then we might rewrite the Arrow buffers.
There was a problem hiding this comment.
yea, I like that idea.
_check_schema_compatible returns a boolean should_cast.
- If schema is exactly the same, return
Falseand skip cast - If schema is "compatible", return
Trueand cast - If schema is not "compatible", throws an error
There was a problem hiding this comment.
It was too complicated when _check_schema_compatible returned a boolean and threw an error.
I ended up doing an extra comparison as Arrow schemas outside and cast only if necessary
Fokko
left a comment
There was a problem hiding this comment.
Thanks for working on this @kevinjqliu
| _check_schema(self.schema(), other_schema=df.schema) | ||
| _check_schema_compatible(self.schema(), other_schema=df.schema) | ||
| # cast if the two schemas are compatible but not equal | ||
| if self.schema().as_arrow() != df.schema: |
There was a problem hiding this comment.
nit: It would be good to call as_arrow() just once in case we need to cast.
Backport to 0.6.1
Backport to 0.6.1
This PR resolves #520 by casting the incoming pyarrow
Tableto the same Schema as the Iceberg Table. This is safe since we first check for schema compatibility._check_schema->_check_schema_compatible.Schema.as_arrow().