-
Notifications
You must be signed in to change notification settings - Fork 302
fix: add metadata_properties to _construct_parameters when update hive table #2013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: add metadata_properties to _construct_parameters when update hive table #2013
Conversation
fa27421
to
df294d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kadai0308 thanks for working on this. Would it be possible to add a test? We have Hive container running that we use for tests. This way we don't break it in the future.
Maybe we can add a test somewhere here:
def test_table_properties(catalog: Catalog) -> None: |
The
session_catalog_hive
is a HiveCatalog
.
sure |
df294d1
to
ff149e8
Compare
@Fokko Can you help me review the PR? Thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM! I added a few comments. Thanks for working on this!
properties = {PROP_EXTERNAL: "TRUE", PROP_TABLE_TYPE: "ICEBERG", PROP_METADATA_LOCATION: metadata_location} | ||
if previous_metadata_location: | ||
properties[PROP_PREVIOUS_METADATA_LOCATION] = previous_metadata_location | ||
|
||
if metadata_properties: | ||
for key, value in metadata_properties.items(): | ||
if key not in properties: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 this is fine, it helps with not re-setting PROP_EXTERNAL
, PROP_TABLE_TYPE
, PROP_METADATA_LOCATION
, and PROP_PREVIOUS_METADATA_LOCATION
tests/integration/test_reads.py
Outdated
@@ -111,6 +112,23 @@ def test_table_properties(catalog: Catalog) -> None: | |||
table.transaction().set_properties(property_name=None).commit_transaction() | |||
assert "None type is not a supported value in properties: property_name" in str(exc_info.value) | |||
|
|||
if isinstance(catalog, HiveCatalog): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a great test! could you move this into its own test function?
with just hive catalog
@pytest.mark.integration
@pytest.mark.parametrize("catalog", [pytest.lazy_fixture("session_catalog_hive")])
tests/integration/test_reads.py
Outdated
@@ -111,6 +112,23 @@ def test_table_properties(catalog: Catalog) -> None: | |||
table.transaction().set_properties(property_name=None).commit_transaction() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering why the rest of these tests pass since we're not setting the properties in the HMS. Turns out the table properties are saved in the table metadata using its properties
field.
This is not what the table metadata's properties field should be used for,
properties A string to string map of table properties. This is used to control settings that affect reading and writing and is not intended to be used for arbitrary metadata. For example, commit.retry.num-retries is used to control the number of commit retries.
This is a side affect of
iceberg-python/pyiceberg/table/__init__.py
Lines 1131 to 1134 in a67c559
@property | |
def properties(self) -> Dict[str, str]: | |
"""Properties of the table.""" | |
return self.metadata.properties |
iceberg-python/pyiceberg/catalog/hive.py
Lines 338 to 344 in a67c559
return Table( | |
identifier=(table.dbName, table.tableName), | |
metadata=metadata, | |
metadata_location=metadata_location, | |
io=self._load_file_io(metadata.properties, metadata_location), | |
catalog=self, | |
) |
We should fix this behavior and read/write properties using the HMS's table parameters. We can fix this separately from the current issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #2064 to track this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR will save the properties in both the HMS's table parameter and table metadata's properties field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review and suggestion.
Yes, this is also confuse me when I develop this PR. Then I found:
https://github.com/kadai0308/iceberg-python/blob/ff149e8e9d8e0b8dd9e74158b1fb89724833b5b4/pyiceberg/catalog/hive.py#L342-L344
So to make sure it did write to the HMS properties, I need to new a hive_client
to get the hive_table.parameters
:
hive_client: _HiveClient = _HiveClient(catalog.properties["uri"])
with hive_client as open_client:
hive_table = open_client.get_table(*TABLE_NAME)
assert hive_table.parameters.get("abc") == "def"
assert hive_table.parameters.get("p1") == "123"
instead of just test like:
table = create_table(catalog)
assert table.properties == dict(p1="123", **DEFAULT_PROPERTIES)
I think I can also help with #2064.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me @kadai0308, thanks for adding the test 👍
@@ -112,6 +113,27 @@ def test_table_properties(catalog: Catalog) -> None: | |||
assert "None type is not a supported value in properties: property_name" in str(exc_info.value) | |||
|
|||
|
|||
@pytest.mark.integration | |||
@pytest.mark.parametrize("catalog", [pytest.lazy_fixture("session_catalog_hive")]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we don't need to use parameterize with a single argument
@kadai0308, there is an issue with the code formatting, can you run |
Closes: #2010
Rationale for this change
This change adds metadata_properties to the _construct_parameters function to ensure metadata properties are included in the parameters.
I'm not entirely confident about the changes, so please let me know if my understanding is correct—if so, I’ll proceed to add tests.
Thanks you!
Are these changes tested?
Not yet.
Are there any user-facing changes?
Not sure.