Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Show metadata.json path in DESC TABLE EXTENDED #5006

Closed
wants to merge 1 commit into from

Conversation

singhpk234
Copy link
Contributor

About Change

The change attempts to show metadata location in DESC TABLE EXTENDED, this adds metadata-location to the table properties.

Why this change is Required

We can use this to figure out the metadata pointer this table points to and use it to register table via StoredProcedure recently introduced. This could be convenient for users who only want to use SPARK-SQL.

Testing

Added a UT.
Add one more UT for path validation

@github-actions github-actions bot added the spark label Jun 9, 2022
@rdblue
Copy link
Contributor

rdblue commented Jun 12, 2022

@singhpk234, is this something we want to expose?

@singhpk234
Copy link
Contributor Author

singhpk234 commented Jun 13, 2022

@rdblue, Considering now we have register table stored-procedure in, was thinking if there is a pure SQL way to find this property. I had a migration use case (i.e migrate hive tables to glue), had to go to glue UI to find this path and then register it via SQL and also vice-versa went to hive table param's to find it and register in glue.

MariaDB [hive]> SELECT * from TBLS where TBL_NAME='store_sales'
MariaDB [hive]> SELECT * FROM TABLE_PARAMS where TBL_ID = xx

Apologies if there exists some other SQL way to find it, I thought that having this exposed in describe table could have helped me. I agree this is not a property of the table but a state of the table (we also expose current_snapshot_id). This was my rationale behind putting a pr out for this change. Would love to know your thoughts about the same.

There exists an alternative way for us to find this as well, so it's not like it can't be worked around :

Table table = Spark3Util.loadIcebergTable(spark, tableName);
String metadataJson = ((HiveTableOperations) (((HasTableOperations) table).operations())).currentMetadataLocation();

@rdblue
Copy link
Contributor

rdblue commented Jun 29, 2022

Right now, you can use input_file_name() on some metadata tables to get this, but that's mostly a hack. I don't think that we want to expose this detail to users, but I could be convinced otherwise. I'm skeptical about the register table use case. Wouldn't you want to export from the current metastore so you don't have a duplicate table? That use case seems like a dangerous way to migrate.

@singhpk234
Copy link
Contributor Author

singhpk234 commented Jul 1, 2022

I don't think that we want to expose this detail to users, but I could be convinced otherwise

Agree with you, hence opened #5063 based on @jackye1995 suggestion. Apologies I forgot to close this PR.

Wouldn't you want to export from the current metastore so you don't have a duplicate table? That use case seems like a dangerous way to migrate

In the case above I wanted to have duplicate tables (1 per each catalog), was bench-marking tpc-ds perf using various catalog (hive / glue). This is a very niche use case (and apologies not entirely a migration use case) though and very dangerous as well for prod use cases.

@singhpk234
Copy link
Contributor Author

Superceded by #5063

@singhpk234 singhpk234 closed this Jul 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants