Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS: copy Iceberg schema to Glue #3048

Merged
merged 2 commits into from
Sep 2, 2021
Merged

Conversation

jackye1995
Copy link
Contributor

As a part of Glue UI improvement, add a copy of Iceberg schema in Glue.
The converted schema is only informational, it does not have any functional usage.

@yyanyy

@github-actions github-actions bot added the AWS label Aug 29, 2021
.collect(Collectors.toList());
}

private static String toTypeString(Type type) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't fully thought about the type conversion yet, just want to throw this here for discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really similar to the conversion in HiveSchemaUtil. Is it possible to share that?

@@ -132,4 +138,61 @@ static void validateTableIdentifier(TableIdentifier tableIdentifier) {
validateNamespace(tableIdentifier.namespace());
validateTableName(tableIdentifier.name());
}

static List<Column> toColumns(Schema schema) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, we'd put conversion methods like this in a class called GlueSchemaUtil instead of this converter class that is meant to be internal. It would not be unreasonable to open up these methods later and if we do we want some consistency in the public API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree, I am aware of the HiveSchemaUtil class, but given these are just for informational purpose, because people who have access to their Glue table might not have permission in S3, but they should still see information like schema, etc.

I don't want people to actually take a dependency or use it as some reference implementations. The actual source of truth is always the underlying Iceberg table metadata, that's why I am keeping them in this protected class. If in the future there is actually a way to perform some integration with services like Glue ETL, we will publish a more defined converter of schema that can be public, but so far there is no plan for it.

I have added some explanations in javadoc, please let me know if you are fine with it, thanks!

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good.

@jackye1995
Copy link
Contributor Author

@rdblue thanks for the quick review!

After experimenting the whole night for a few different ways, I think it is likely better to

  1. directly map Iceberg partition fields to Glue partition keys, in that sense the iceberg partition fields are basically "virtual columns" in Glue
  2. add the full Iceberg field string as the comment of each column, so that users can see additional information like optional/required, nested field column ID, etc.

Here is an example result stored in Glue:

{
    "Table": {
        "Name": "test",
        "DatabaseName": "jack",
        "CreateTime": "2021-08-31T22:56:30-07:00",
        "UpdateTime": "2021-08-31T22:56:30-07:00",
        "Retention": 0,
        "StorageDescriptor": {
            "Columns": [
                {
                    "Name": "i",
                    "Type": "int",
                    "Comment": "Iceberg column: { 1: i: required int }"
                },
                {
                    "Name": "l",
                    "Type": "bigint",
                    "Comment": "Iceberg column: { 2: l: required long }"
                },
                {
                    "Name": "d",
                    "Type": "date",
                    "Comment": "Iceberg column: { 3: d: required date }"
                },
                {
                    "Name": "t",
                    "Type": "string",
                    "Comment": "Iceberg column: { 4: t: required time }"
                },
                {
                    "Name": "ts",
                    "Type": "timestamp",
                    "Comment": "Iceberg column: { 5: ts: required timestamp }"
                },
                {
                    "Name": "tstz",
                    "Type": "timestamp",
                    "Comment": "Iceberg column: { 6: tstz: required timestamptz }"
                },
                {
                    "Name": "dec",
                    "Type": "decimal(9,2)",
                    "Comment": "Iceberg column: { 7: dec: required decimal(9, 2) }"
                },
                {
                    "Name": "s",
                    "Type": "string",
                    "Comment": "Iceberg column: { 8: s: required string }"
                },
                {
                    "Name": "u",
                    "Type": "string",
                    "Comment": "Iceberg column: { 9: u: required uuid }"
                },
                {
                    "Name": "f",
                    "Type": "binary",
                    "Comment": "Iceberg column: { 10: f: required fixed[3] }"
                },
                {
                    "Name": "b",
                    "Type": "binary",
                    "Comment": "Iceberg column: { 11: b: required binary }"
                },
                {
                    "Name": "struct",
                    "Type": "struct<i2:int,l2:bigint,d2:date>",
                    "Comment": "Iceberg column: { 12: struct: required struct<15: i2: required int, 16: l2: required long, 17: d2: required date> }"
                },
                {
                    "Name": "list",
                    "Type": "array<struct<i3:int,l3:bigint,d3:date>>",
                    "Comment": "Iceberg column: { 13: list: required list<struct<19: i3: required int, 20: l3: required long, 21: d3: required date>> }"
                },
                {
                    "Name": "map",
                    "Type": "map<string,struct<i4:int,l5:bigint,d6:date>>",
                    "Comment": "Iceberg column: { 14: map: required map<string, struct<24: i4: required int, 25: l5: required long, 26: d6: required date>> }"
                }
            ],
            "Location": "s3://bucket/path",
            "Compressed": false,
            "NumberOfBuckets": 0,
            "SortColumns": [],
            "StoredAsSubDirectories": false
        },
        "PartitionKeys": [
            {
                "Name": "s_bucket",
                "Type": "int",
                "Comment": "Iceberg partition field: {1000: s_bucket: bucket[16](8)}"
            },
            {
                "Name": "map.i4_trunc",
                "Type": "int",
                "Comment": "Iceberg partition field: {1001: map.i4_trunc: truncate[2](24)}"
            },
            {
                "Name": "ts_day",
                "Type": "date",
                "Comment": "Iceberg partition field: {1002: ts_day: day(5)}"
            },
            {
                "Name": "i",
                "Type": "int",
                "Comment": "Iceberg partition field: {1003: i: identity(1)}"
            },
            {
                "Name": "struct.i2",
                "Type": "int",
                "Comment": "Iceberg partition field: {1004: struct.i2: identity(15)}"
            }
        ],
        "TableType": "EXTERNAL_TABLE",
        "Parameters": {
            "metadata_location": "s3://bucket/path",
            "table_type": "ICEBERG"
        }
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants