-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS: copy Iceberg schema to Glue #3048
Conversation
.collect(Collectors.toList()); | ||
} | ||
|
||
private static String toTypeString(Type type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't fully thought about the type conversion yet, just want to throw this here for discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really similar to the conversion in HiveSchemaUtil
. Is it possible to share that?
@@ -132,4 +138,61 @@ static void validateTableIdentifier(TableIdentifier tableIdentifier) { | |||
validateNamespace(tableIdentifier.namespace()); | |||
validateTableName(tableIdentifier.name()); | |||
} | |||
|
|||
static List<Column> toColumns(Schema schema) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally, we'd put conversion methods like this in a class called GlueSchemaUtil
instead of this converter class that is meant to be internal. It would not be unreasonable to open up these methods later and if we do we want some consistency in the public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree, I am aware of the HiveSchemaUtil
class, but given these are just for informational purpose, because people who have access to their Glue table might not have permission in S3, but they should still see information like schema, etc.
I don't want people to actually take a dependency or use it as some reference implementations. The actual source of truth is always the underlying Iceberg table metadata, that's why I am keeping them in this protected class. If in the future there is actually a way to perform some integration with services like Glue ETL, we will publish a more defined converter of schema that can be public, but so far there is no plan for it.
I have added some explanations in javadoc, please let me know if you are fine with it, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good.
@rdblue thanks for the quick review! After experimenting the whole night for a few different ways, I think it is likely better to
Here is an example result stored in Glue: {
"Table": {
"Name": "test",
"DatabaseName": "jack",
"CreateTime": "2021-08-31T22:56:30-07:00",
"UpdateTime": "2021-08-31T22:56:30-07:00",
"Retention": 0,
"StorageDescriptor": {
"Columns": [
{
"Name": "i",
"Type": "int",
"Comment": "Iceberg column: { 1: i: required int }"
},
{
"Name": "l",
"Type": "bigint",
"Comment": "Iceberg column: { 2: l: required long }"
},
{
"Name": "d",
"Type": "date",
"Comment": "Iceberg column: { 3: d: required date }"
},
{
"Name": "t",
"Type": "string",
"Comment": "Iceberg column: { 4: t: required time }"
},
{
"Name": "ts",
"Type": "timestamp",
"Comment": "Iceberg column: { 5: ts: required timestamp }"
},
{
"Name": "tstz",
"Type": "timestamp",
"Comment": "Iceberg column: { 6: tstz: required timestamptz }"
},
{
"Name": "dec",
"Type": "decimal(9,2)",
"Comment": "Iceberg column: { 7: dec: required decimal(9, 2) }"
},
{
"Name": "s",
"Type": "string",
"Comment": "Iceberg column: { 8: s: required string }"
},
{
"Name": "u",
"Type": "string",
"Comment": "Iceberg column: { 9: u: required uuid }"
},
{
"Name": "f",
"Type": "binary",
"Comment": "Iceberg column: { 10: f: required fixed[3] }"
},
{
"Name": "b",
"Type": "binary",
"Comment": "Iceberg column: { 11: b: required binary }"
},
{
"Name": "struct",
"Type": "struct<i2:int,l2:bigint,d2:date>",
"Comment": "Iceberg column: { 12: struct: required struct<15: i2: required int, 16: l2: required long, 17: d2: required date> }"
},
{
"Name": "list",
"Type": "array<struct<i3:int,l3:bigint,d3:date>>",
"Comment": "Iceberg column: { 13: list: required list<struct<19: i3: required int, 20: l3: required long, 21: d3: required date>> }"
},
{
"Name": "map",
"Type": "map<string,struct<i4:int,l5:bigint,d6:date>>",
"Comment": "Iceberg column: { 14: map: required map<string, struct<24: i4: required int, 25: l5: required long, 26: d6: required date>> }"
}
],
"Location": "s3://bucket/path",
"Compressed": false,
"NumberOfBuckets": 0,
"SortColumns": [],
"StoredAsSubDirectories": false
},
"PartitionKeys": [
{
"Name": "s_bucket",
"Type": "int",
"Comment": "Iceberg partition field: {1000: s_bucket: bucket[16](8)}"
},
{
"Name": "map.i4_trunc",
"Type": "int",
"Comment": "Iceberg partition field: {1001: map.i4_trunc: truncate[2](24)}"
},
{
"Name": "ts_day",
"Type": "date",
"Comment": "Iceberg partition field: {1002: ts_day: day(5)}"
},
{
"Name": "i",
"Type": "int",
"Comment": "Iceberg partition field: {1003: i: identity(1)}"
},
{
"Name": "struct.i2",
"Type": "int",
"Comment": "Iceberg partition field: {1004: struct.i2: identity(15)}"
}
],
"TableType": "EXTERNAL_TABLE",
"Parameters": {
"metadata_location": "s3://bucket/path",
"table_type": "ICEBERG"
}
}
} |
As a part of Glue UI improvement, add a copy of Iceberg schema in Glue.
The converted schema is only informational, it does not have any functional usage.
@yyanyy