Skip to content

fix: add metadata_properties to _construct_parameters when update hive table #2013

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

kadai0308
Copy link

@kadai0308 kadai0308 commented May 18, 2025

Closes: #2010

Rationale for this change

This change adds metadata_properties to the _construct_parameters function to ensure metadata properties are included in the parameters.
I'm not entirely confident about the changes, so please let me know if my understanding is correct—if so, I’ll proceed to add tests.
Thanks you!

Are these changes tested?

Not yet.

Are there any user-facing changes?

Not sure.

@kadai0308 kadai0308 force-pushed the fix/hive-client-does-not-update-table-properties branch from fa27421 to df294d1 Compare May 18, 2025 13:50
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kadai0308 thanks for working on this. Would it be possible to add a test? We have Hive container running that we use for tests. This way we don't break it in the future.

Maybe we can add a test somewhere here:

def test_table_properties(catalog: Catalog) -> None:

The session_catalog_hive is a HiveCatalog.

@kadai0308
Copy link
Author

@kadai0308 thanks for working on this. Would it be possible to add a test? We have Hive container running that we use for tests. This way we don't break it in the future.

Maybe we can add a test somewhere here:

def test_table_properties(catalog: Catalog) -> None:

The session_catalog_hive is a HiveCatalog.

sure

@kadai0308 kadai0308 force-pushed the fix/hive-client-does-not-update-table-properties branch from df294d1 to ff149e8 Compare May 25, 2025 13:17
@kadai0308
Copy link
Author

@Fokko Can you help me review the PR? Thank you.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM! I added a few comments. Thanks for working on this!

properties = {PROP_EXTERNAL: "TRUE", PROP_TABLE_TYPE: "ICEBERG", PROP_METADATA_LOCATION: metadata_location}
if previous_metadata_location:
properties[PROP_PREVIOUS_METADATA_LOCATION] = previous_metadata_location

if metadata_properties:
for key, value in metadata_properties.items():
if key not in properties:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this is fine, it helps with not re-setting PROP_EXTERNAL, PROP_TABLE_TYPE, PROP_METADATA_LOCATION, and PROP_PREVIOUS_METADATA_LOCATION

@@ -111,6 +112,23 @@ def test_table_properties(catalog: Catalog) -> None:
table.transaction().set_properties(property_name=None).commit_transaction()
assert "None type is not a supported value in properties: property_name" in str(exc_info.value)

if isinstance(catalog, HiveCatalog):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a great test! could you move this into its own test function?

with just hive catalog

@pytest.mark.integration
@pytest.mark.parametrize("catalog", [pytest.lazy_fixture("session_catalog_hive")])

@@ -111,6 +112,23 @@ def test_table_properties(catalog: Catalog) -> None:
table.transaction().set_properties(property_name=None).commit_transaction()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering why the rest of these tests pass since we're not setting the properties in the HMS. Turns out the table properties are saved in the table metadata using its properties field.

This is not what the table metadata's properties field should be used for,

properties	A string to string map of table properties. This is used to control settings that affect reading and writing and is not intended to be used for arbitrary metadata. For example, commit.retry.num-retries is used to control the number of commit retries.

This is a side affect of

@property
def properties(self) -> Dict[str, str]:
"""Properties of the table."""
return self.metadata.properties
and
return Table(
identifier=(table.dbName, table.tableName),
metadata=metadata,
metadata_location=metadata_location,
io=self._load_file_io(metadata.properties, metadata_location),
catalog=self,
)

We should fix this behavior and read/write properties using the HMS's table parameters. We can fix this separately from the current issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #2064 to track this

Copy link
Contributor

@kevinjqliu kevinjqliu Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR will save the properties in both the HMS's table parameter and table metadata's properties field

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review and suggestion.

Yes, this is also confuse me when I develop this PR. Then I found:
https://github.com/kadai0308/iceberg-python/blob/ff149e8e9d8e0b8dd9e74158b1fb89724833b5b4/pyiceberg/catalog/hive.py#L342-L344

So to make sure it did write to the HMS properties, I need to new a hive_client to get the hive_table.parameters:

hive_client: _HiveClient = _HiveClient(catalog.properties["uri"])

with hive_client as open_client:
    hive_table = open_client.get_table(*TABLE_NAME)
    assert hive_table.parameters.get("abc") == "def"
    assert hive_table.parameters.get("p1") == "123"

instead of just test like:

table = create_table(catalog)
assert table.properties == dict(p1="123", **DEFAULT_PROPERTIES)

I think I can also help with #2064.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me @kadai0308, thanks for adding the test 👍

@@ -112,6 +113,27 @@ def test_table_properties(catalog: Catalog) -> None:
assert "None type is not a supported value in properties: property_name" in str(exc_info.value)


@pytest.mark.integration
@pytest.mark.parametrize("catalog", [pytest.lazy_fixture("session_catalog_hive")])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we don't need to use parameterize with a single argument

@Fokko
Copy link
Contributor

Fokko commented Jun 8, 2025

@kadai0308, there is an issue with the code formatting, can you run make lint? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug] hive client does not update table properties
4 participants