Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Fix non-setting row-lineage from table properties on initial table creation #12307

Merged
merged 1 commit into from
Feb 18, 2025

Conversation

tomtongue
Copy link
Contributor

@tomtongue tomtongue commented Feb 18, 2025

Overview

Fix the row-lineage table property reflection on enableRowLineage.

Issue

Currently to enable the Row Lineage feature from the Iceberg table properties, it's required to run the following operations:

  1. Create an Iceberg table
  2. Update table properties

At the first step "Create an Iceberg table", even if you set row-lineage to true in the table properties, the property isn't reflected on the Iceberg table's metadata.json. Therefore, to enable that feature, you need to additionally run table properties update after creating an Iceberg table.

Details

Tested two cases such as Spark and Java API

Spark case

When you create an Iceberg table using Spark like the following query,

spark.sql("""
CREATE TABLE db.rowlin (id int, name string, year int) USING iceberg
TBLPROPERTIES ('format-version'='3', 'row-lineage'='true')
LOCATION 's3://bucket/iceberg-v3/row-lineage'
""")

The relevant metadata.json is stored in the specified bucket and path as below:

aws s3 ls s3://bucket/iceberg-v3/row-lineage/ --recursive
2025-02-18 16:56:28       1194 iceberg-v3/row-lineage/metadata/00000-1eb8c96e-f503-4ff9-b4e0-53cb3ede0116.metadata.json

At this point, the metadata content (partial) is below. The content doesn't have row-lineage even if the parameter is in the properties part.

{
  "format-version" : 3,
  "table-uuid" : "eaf5dec9-7866-49a5-81c6-11af8f344e1f",
  "location" : "s3://bucket/iceberg-v3/row-lineage",
  "last-sequence-number" : 0,
  "last-updated-ms" : 1739865386995,
  "last-column-id" : 3,
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ { ... } ]
  } ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ ]
  } ],
  "last-partition-id" : 999,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : {
    "owner" : "hadoop",
    "write.update.mode" : "merge-on-read",
    "write.parquet.compression-codec" : "zstd",
    "row-lineage" : "true"
  },
  "current-snapshot-id" : null,
...
}

And then, update the table property by the same table property like ALTER TABLE db.rowlin SET TBLPROPERTIES('row-lineage'= 'true').

After the query is complete, the content of the new metadata.json is below. row-lineage and next-row-id is added.

{
  "format-version" : 3,
  "table-uuid" : "eaf5dec9-7866-49a5-81c6-11af8f344e1f",
  "location" : "s3://bucket/iceberg-v3/row-lineage",
  "last-sequence-number" : 0,
  "last-updated-ms" : 1739865514775,
  "last-column-id" : 3,
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ { ... } ]
  } ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ ]
  } ],
  "last-partition-id" : 999,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : {
    "owner" : "hadoop",
    "write.update.mode" : "merge-on-read",
    "write.parquet.compression-codec" : "zstd",
    "row-lineage" : "true"
  },
  "current-snapshot-id" : null,
  "row-lineage" : true,  // <= ADDED
  "next-row-id" : 0, // <= ADDED
  "refs" : { },
  "snapshots" : [ ],
  "statistics" : [ ],
  "partition-statistics" : [ ],
  "snapshot-log" : [ ],
  "metadata-log" : [ {
    "timestamp-ms" : 1739865386995,
    "metadata-file" : "s3://bucket/iceberg-v3/row-lineage/metadata/00000-1eb8c96e-f503-4ff9-b4e0-53cb3ede0116.metadata.json"
  } ]
}

Here's the diff between two metadata files:

$ diff 00000-1eb8c96e-f503-4ff9-b4e0-53cb3ede0116.metadata.json 00001-ebf641c8-9603-45d5-92c6-dafac315375e.metadata.json
6c6
<   "last-updated-ms" : 1739865386995,
---
>   "last-updated-ms" : 1739865514775,
46a47,48
>   "row-lineage" : true,
>   "next-row-id" : 0,
52c54,57
<   "metadata-log" : [ ]
---
>   "metadata-log" : [ {
>     "timestamp-ms" : 1739865386995,
>     "metadata-file" : "s3://gsweep/iceberg-v3/row-lineage-mor13/metadata/00000-1eb8c96e-f503-4ff9-b4e0-53cb3ede0116.metadata.json"
>   } ]

Java API case

Example script:

public class Main {
    private static final RESTCatalog catalog = new RESTCatalog();
    private static final Schema schema = new Schema(
            Types.NestedField.required(1, "id", Types.IntegerType.get()),
            Types.NestedField.required(2, "name", Types.StringType.get()),
            Types.NestedField.required(3, "year", Types.IntegerType.get()),
            Types.NestedField.required(4, "category", Types.StringType.get())
    );

    public static void main(String[] args) throws IOException {
        Map<String, String> properties = new HashMap<>();
        properties.put(CatalogProperties.CATALOG_IMPL, "org.apache.iceberg.rest.RESTCatalog");
        properties.put(CatalogProperties.URI, "http://localhost:8181");
        properties.put(CatalogProperties.FILE_IO_IMPL, S3FileIO.class.getName());

        catalog.setConf(new Configuration());
        catalog.initialize("rest", properties);

        // Params
        String db = "db";
        Namespace ns = Namespace.of(db);
        String tableName = "rowlin_java";
        String location = String.format("s3://bucket/iceberg-java/%s/%s", db, tableName);
        Map<String, String> tblprops = new HashMap<>();
        tblprops.put("format-version", "3");
        tblprops.put(TableProperties.ROW_LINEAGE, "true");
        TableIdentifier tbl = TableIdentifier.of(ns, tableName);

        // Create a table
        createTableIfNotExists(db, tableName, schema, location, tblprops);

        // Load the table
        Table table = catalog.loadTable(tbl);

        // Update table property
        table.updateProperties().set(TableProperties.ROW_LINEAGE, "true").commit();
    }

    private static void createTableIfNotExists(String db, String table, Schema schema, String location, Map<String, String> properties) {
        TableIdentifier tbl = TableIdentifier.of(db, table);
        Namespace ns = Namespace.of(db);
        if (!catalog.listTables(ns).contains(tbl)) {
            catalog.createTable(tbl, schema, null, location, properties);
        }
    }
}

After running the script, two versions of metadata.json files are created in the specified s3 bucket:

$ aws s3 ls s3://bucket/iceberg-java/db/rowlin_java --recursive
2025-02-18 19:54:11       1224 iceberg-java/db/rowlin_java/metadata/00000-6e7ccb2c-4eba-45e5-8cda-500e487729a3.metadata.json
2025-02-18 19:54:15       1441 iceberg-java/db/rowlin_java/metadata/00001-74955cfb-05d3-4b95-9f74-651c16389b3e.metadata.json

Each content of the metadata file is below:

// 00000-6e7ccb2c-4eba-45e5-8cda-500e487729a3.metadata.json
$ aws s3 cp s3://bucket/iceberg-java/db/rowlin_java/metadata/00000-6e7ccb2c-4eba-45e5-8cda-500e487729a3.metadata.json - | grep row -n3 -> there's no row-lineage

1-{
2-  "format-version" : 3,
3-  "table-uuid" : "27529028-4f30-4100-a3e7-c23984ab2ccb",
4:  "location" : "s3://bucket/iceberg-java/db/rowlin_java",
5-  "last-sequence-number" : 0,
6-  "last-updated-ms" : 1739876049987,
7-  "last-column-id" : 4,
--
44-  } ],
45-  "properties" : {
46-    "write.parquet.compression-codec" : "zstd",
47:    "row-lineage" : "true"
48-  },
49-  "current-snapshot-id" : null,
50-  "refs" : { },
// 00001-74955cfb-05d3-4b95-9f74-651c16389b3e.metadata.json
$ aws s3 cp s3://bucket/iceberg-java/db/rowlin_java/metadata/00001-74955cfb-05d3-4b95-9f74-651c16389b3e.metadata.json - | grep row -n3 -> row-lineage is added.

1-{
2-  "format-version" : 3,
3-  "table-uuid" : "27529028-4f30-4100-a3e7-c23984ab2ccb",
4:  "location" : "s3://bucket/iceberg-java/db/rowlin_java",
5-  "last-sequence-number" : 0,
6-  "last-updated-ms" : 1739876054823,
7-  "last-column-id" : 4,
--
44-  } ],
45-  "properties" : {
46-    "write.parquet.compression-codec" : "zstd",
47:    "row-lineage" : "true"
48-  },
49-  "current-snapshot-id" : null,
50:  "row-lineage" : true, // <= ADDED
51:  "next-row-id" : 0, // <= ADDED
52-  "refs" : { },
53-  "snapshots" : [ ],
54-  "statistics" : [ ],
--
56-  "snapshot-log" : [ ],
57-  "metadata-log" : [ {
58-    "timestamp-ms" : 1739876049987,
59:    "metadata-file" : "s3://gsweep/iceberg-java/db/rowlin_java/metadata/00000-6e7ccb2c-4eba-45e5-8cda-500e487729a3.metadata.json"
60-

@github-actions github-actions bot added the core label Feb 18, 2025
@tomtongue tomtongue changed the title Core: Fix setting row-lineage from table properties when initially creating an Iceberg table Core: Fix non-setting row-lineage from table properties when initially creating an Iceberg table Feb 18, 2025
@tomtongue
Copy link
Contributor Author

@RussellSpitzer I believe you're working on the row-lineage feature. When you have a chance, could you check this issue and review the change?

@tomtongue tomtongue changed the title Core: Fix non-setting row-lineage from table properties when initially creating an Iceberg table Core: Fix non-setting row-lineage from table properties on initial table creation Feb 18, 2025
@RussellSpitzer
Copy link
Member

This was intentional. We are avoiding writing the property so that Table Metadata Versions that do not include row-lineage at all will not populate the fields.

@RussellSpitzer
Copy link
Member

Ah wait I understand this more in your code than in this description. You are just saying that you can't enable row-lineage during a Create statement. That sounds like it's fine to fix.

@@ -146,6 +151,7 @@ static TableMetadata newTableMetadata(
.setDefaultSortOrder(freshSortOrder)
.setLocation(location)
.setProperties(properties)
.setRowLineage(rowLineage)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to set RowLineage to false if it is unset. So we have 2 options here

  1. Set it to false but only if the table is V3
  2. Set it to true only if it is true and leave it absent otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks for checking. Let me fix this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You actually don't have to do anything! I forgot I already did this in "setRowLineage". It won't set the field unless it's getting set to true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer If you don't set row-lineage in the table properties, it's unset, I mean it's kept empty because setRowLineage can handle null argument such as:

private Builder setRowLineage(Boolean newRowLineage) {
if (newRowLineage == null) {
return this;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, it is. Thanks for letting me know. So I believe this part should be fine.

@tomtongue
Copy link
Contributor Author

tomtongue commented Feb 18, 2025

Ah wait I understand this more in your code than in this description. You are just saying that you can't enable row-lineage during a Create statement. That sounds like it's fine to fix.

Thanks so much for the quick review! Yes, currently, it's not possible to enable row-lineage in CreateTable statement, and it's always required to set row-lineage: true again even if row-lineage is set true in CreateTable.

Additionally, after creating an Iceberg table without setting row-lineage: true in the table properties, even if you set the row lineage parameter into the table properties, the row lineage feature is not enabled. So, for now, it's always needed to update table properties with ALTER TABLE SET TBLPROPERTIES, updateProperties etc.

This fix enables the row lineage feature once row-lineage: true is set into the Iceberg table properties.

@RussellSpitzer RussellSpitzer merged commit e8d3a06 into apache:main Feb 18, 2025
46 checks passed
@RussellSpitzer
Copy link
Member

Thanks @tomtongue !

@tomtongue
Copy link
Contributor Author

Thanks so much for the quick review! @RussellSpitzer

@tomtongue tomtongue deleted the fix-row-lineage-tblprops branch February 18, 2025 16:05
ankurbansal-tradedoubler added a commit to ankurbansal-tradedoubler/iceberg that referenced this pull request Feb 19, 2025
* Site: Learn More to point to Spark QuickStart Doc (apache#12272)

* Build: Bump datamodel-code-generator from 0.27.2 to 0.28.1 (apache#12290)

* Spark 3.5: Fix job description of RewriteTablePathSparkAction (apache#12282)

* Build: Bump io.netty:netty-buffer from 4.1.117.Final to 4.1.118.Final (apache#12287)

Bumps [io.netty:netty-buffer](https://github.com/netty/netty) from 4.1.117.Final to 4.1.118.Final.
- [Commits](netty/netty@netty-4.1.117.Final...netty-4.1.118.Final)

---
updated-dependencies:
- dependency-name: io.netty:netty-buffer
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Build: Bump software.amazon.awssdk:bom from 2.30.16 to 2.30.21 (apache#12286)

Bumps software.amazon.awssdk:bom from 2.30.16 to 2.30.21.

---
updated-dependencies:
- dependency-name: software.amazon.awssdk:bom
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* OpenAPI: Add overwrite option when registering a table (apache#12239)

* OpenAPI: Add optional overwrite when registering table

* simplify to overwrite

* Add the article to the description

Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>

* Update generated python as well

Signed-off-by: Hongyue Zhang <steveiszhy@gmail.com>

* Fix import order

---------

Signed-off-by: Hongyue Zhang <steveiszhy@gmail.com>
Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>

* Build: Bump mkdocs-material from 9.6.3 to 9.6.4 (apache#12284)

Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.6.3 to 9.6.4.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](squidfunk/mkdocs-material@9.6.3...9.6.4)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Core: Fix Enabling row-lineage during Create Table (apache#12307)

* API: Reject unknown type for required fields and validate defaults (apache#12302)

* API: Fix TestInclusiveMetricsEvaluator notStartsWith tests. (apache#12303)

* Core: Add variant type support to utils and visitors (apache#11831)

* Core: Fix CI: Update tests with UnknownType from required to optional (apache#12316)

* Docs: Refactor site navigation bar (apache#12289)

* Parquet: Implement Variant readers (apache#12139)

* Docs: Add rewrite_table_path Spark Procedure (apache#12115)

* Parquet: Fix errorprone warning (apache#12324)

* Docs: Add Apache Amoro docs (apache#11966)

* Parquet: Fix performance regression in reader init (apache#12305)

* Core: Fallback to GET requests for namespace/table/view exists checks (apache#12314)

Co-authored-by: Daniel Weeks <dweeks@apache.org>

* Docs: Fix refs in Apache Amoro docs (apache#12332)

* Revert "Core: Serialize `null` when there is no current snapshot (apache#11560)" (apache#12312)

This reverts commit bf8d25f.

* Parquet: Fix performance regression in reader init (apache#12305) (apache#12329)

Co-authored-by: Bryan Keller <bryanck@gmail.com>

* Checkstyle: Apply the same generic type naming rules to interfaces and classes (apache#12333)

---------

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Hongyue Zhang <steveiszhy@gmail.com>
Co-authored-by: Danica Fine <danica.fine@gmail.com>
Co-authored-by: Manu Zhang <OwenZhang1990@gmail.com>
Co-authored-by: Yuya Ebihara <ebyhry@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Hongyue/Steve Zhang <steveiszhy@gmail.com>
Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>
Co-authored-by: Tom Tanaka <43331405+tomtongue@users.noreply.github.com>
Co-authored-by: Ryan Blue <blue@apache.org>
Co-authored-by: Aihua Xu <aihuaxu@gmail.com>
Co-authored-by: Fokko Driesprong <fokko@apache.org>
Co-authored-by: ConradJam <jam.gzczy@gmail.com>
Co-authored-by: Bryan Keller <bryanck@gmail.com>
Co-authored-by: Daniel Weeks <dweeks@apache.org>
Co-authored-by: pvary <peter.vary.apache@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants