AWS: copy Iceberg schema to Glue #3048

jackye1995 · 2021-08-29T18:51:29Z

As a part of Glue UI improvement, add a copy of Iceberg schema in Glue.
The converted schema is only informational, it does not have any functional usage.

@yyanyy

jackye1995 · 2021-08-29T18:51:44Z

aws/src/main/java/org/apache/iceberg/aws/glue/IcebergToGlueConverter.java

+        .collect(Collectors.toList());
+  }
+
+  private static String toTypeString(Type type) {


I haven't fully thought about the type conversion yet, just want to throw this here for discussion.

This looks really similar to the conversion in HiveSchemaUtil. Is it possible to share that?

rdblue · 2021-08-29T19:43:13Z

aws/src/main/java/org/apache/iceberg/aws/glue/IcebergToGlueConverter.java

@@ -132,4 +138,61 @@ static void validateTableIdentifier(TableIdentifier tableIdentifier) {
    validateNamespace(tableIdentifier.namespace());
    validateTableName(tableIdentifier.name());
  }
+
+  static List<Column> toColumns(Schema schema) {


Normally, we'd put conversion methods like this in a class called GlueSchemaUtil instead of this converter class that is meant to be internal. It would not be unreasonable to open up these methods later and if we do we want some consistency in the public API.

Yes I agree, I am aware of the HiveSchemaUtil class, but given these are just for informational purpose, because people who have access to their Glue table might not have permission in S3, but they should still see information like schema, etc.

I don't want people to actually take a dependency or use it as some reference implementations. The actual source of truth is always the underlying Iceberg table metadata, that's why I am keeping them in this protected class. If in the future there is actually a way to perform some integration with services like Glue ETL, we will publish a more defined converter of schema that can be public, but so far there is no plan for it.

I have added some explanations in javadoc, please let me know if you are fine with it, thanks!

rdblue

Overall looks good.

jackye1995 · 2021-09-01T06:47:15Z

@rdblue thanks for the quick review!

After experimenting the whole night for a few different ways, I think it is likely better to

directly map Iceberg partition fields to Glue partition keys, in that sense the iceberg partition fields are basically "virtual columns" in Glue
add the full Iceberg field string as the comment of each column, so that users can see additional information like optional/required, nested field column ID, etc.

Here is an example result stored in Glue:

{
    "Table": {
        "Name": "test",
        "DatabaseName": "jack",
        "CreateTime": "2021-08-31T22:56:30-07:00",
        "UpdateTime": "2021-08-31T22:56:30-07:00",
        "Retention": 0,
        "StorageDescriptor": {
            "Columns": [
                {
                    "Name": "i",
                    "Type": "int",
                    "Comment": "Iceberg column: { 1: i: required int }"
                },
                {
                    "Name": "l",
                    "Type": "bigint",
                    "Comment": "Iceberg column: { 2: l: required long }"
                },
                {
                    "Name": "d",
                    "Type": "date",
                    "Comment": "Iceberg column: { 3: d: required date }"
                },
                {
                    "Name": "t",
                    "Type": "string",
                    "Comment": "Iceberg column: { 4: t: required time }"
                },
                {
                    "Name": "ts",
                    "Type": "timestamp",
                    "Comment": "Iceberg column: { 5: ts: required timestamp }"
                },
                {
                    "Name": "tstz",
                    "Type": "timestamp",
                    "Comment": "Iceberg column: { 6: tstz: required timestamptz }"
                },
                {
                    "Name": "dec",
                    "Type": "decimal(9,2)",
                    "Comment": "Iceberg column: { 7: dec: required decimal(9, 2) }"
                },
                {
                    "Name": "s",
                    "Type": "string",
                    "Comment": "Iceberg column: { 8: s: required string }"
                },
                {
                    "Name": "u",
                    "Type": "string",
                    "Comment": "Iceberg column: { 9: u: required uuid }"
                },
                {
                    "Name": "f",
                    "Type": "binary",
                    "Comment": "Iceberg column: { 10: f: required fixed[3] }"
                },
                {
                    "Name": "b",
                    "Type": "binary",
                    "Comment": "Iceberg column: { 11: b: required binary }"
                },
                {
                    "Name": "struct",
                    "Type": "struct<i2:int,l2:bigint,d2:date>",
                    "Comment": "Iceberg column: { 12: struct: required struct<15: i2: required int, 16: l2: required long, 17: d2: required date> }"
                },
                {
                    "Name": "list",
                    "Type": "array<struct<i3:int,l3:bigint,d3:date>>",
                    "Comment": "Iceberg column: { 13: list: required list<struct<19: i3: required int, 20: l3: required long, 21: d3: required date>> }"
                },
                {
                    "Name": "map",
                    "Type": "map<string,struct<i4:int,l5:bigint,d6:date>>",
                    "Comment": "Iceberg column: { 14: map: required map<string, struct<24: i4: required int, 25: l5: required long, 26: d6: required date>> }"
                }
            ],
            "Location": "s3://bucket/path",
            "Compressed": false,
            "NumberOfBuckets": 0,
            "SortColumns": [],
            "StoredAsSubDirectories": false
        },
        "PartitionKeys": [
            {
                "Name": "s_bucket",
                "Type": "int",
                "Comment": "Iceberg partition field: {1000: s_bucket: bucket[16](8)}"
            },
            {
                "Name": "map.i4_trunc",
                "Type": "int",
                "Comment": "Iceberg partition field: {1001: map.i4_trunc: truncate[2](24)}"
            },
            {
                "Name": "ts_day",
                "Type": "date",
                "Comment": "Iceberg partition field: {1002: ts_day: day(5)}"
            },
            {
                "Name": "i",
                "Type": "int",
                "Comment": "Iceberg partition field: {1003: i: identity(1)}"
            },
            {
                "Name": "struct.i2",
                "Type": "int",
                "Comment": "Iceberg partition field: {1004: struct.i2: identity(15)}"
            }
        ],
        "TableType": "EXTERNAL_TABLE",
        "Parameters": {
            "metadata_location": "s3://bucket/path",
            "table_type": "ICEBERG"
        }
    }
}

AWS: copy Iceberg schema to Glue

4bdfdc9

github-actions bot added the AWS label Aug 29, 2021

jackye1995 commented Aug 29, 2021

View reviewed changes

rdblue reviewed Aug 29, 2021

View reviewed changes

rdblue approved these changes Aug 29, 2021

View reviewed changes

reorganize comments and partition key fields

beea698

yyanyy approved these changes Sep 2, 2021

View reviewed changes

yyanyy merged commit 12ffb23 into apache:master Sep 2, 2021

jackye1995 mentioned this pull request Oct 20, 2021

Glue catalog table schema improvement #3325

Closed

ekorchison mentioned this pull request Feb 6, 2023

Iceberg tables created with Glue catalog cannot be queried by Athena trinodb/trino#15934

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS: copy Iceberg schema to Glue #3048

AWS: copy Iceberg schema to Glue #3048

jackye1995 commented Aug 29, 2021

jackye1995 Aug 29, 2021

rdblue Aug 29, 2021

rdblue Aug 29, 2021

jackye1995 Sep 1, 2021

rdblue left a comment

jackye1995 commented Sep 1, 2021

AWS: copy Iceberg schema to Glue #3048

AWS: copy Iceberg schema to Glue #3048

Conversation

jackye1995 commented Aug 29, 2021

jackye1995 Aug 29, 2021

Choose a reason for hiding this comment

rdblue Aug 29, 2021

Choose a reason for hiding this comment

rdblue Aug 29, 2021

Choose a reason for hiding this comment

jackye1995 Sep 1, 2021

Choose a reason for hiding this comment

rdblue left a comment

Choose a reason for hiding this comment

jackye1995 commented Sep 1, 2021