Make current-snapshot-id optional while maintaining backwards compatibility by s-akhtar-baig · Pull Request #374 · apache/iceberg-rust

s-akhtar-baig · 2024-05-15T18:19:42Z

Resolves #352

Problem: Previous versions of Java (<1.4.0) implementations incorrectly assume the optional attribute current-snapshot-id to be a required attribute in TableMetadata.

Solution: Use legacy-current-snapshot-id environment variable to force iceberg-rust to create and load a table with a metadata file compatible with the older versions.

Testing: Added new unit tests.

Future work:

Enhancement: Add support for a config file in iceberg-rust #375: Implement a config file for iceberg-rust and provide users an additional option to set legacy-current-snapshot-id as a configuration value.

s-akhtar-baig · 2024-05-15T18:35:35Z

@Fokko, these changes make current-snapshot-id optional and uses a flag to maintain backwards compatibility. Let me know what you think.

Please note that, with the flag on we can create and load a table with -1 as the current-snapshot-id. This is different from pyiceberg which I believe only supports creating a table.

s-akhtar-baig · 2024-05-15T18:37:51Z

Also, do we have a preference on where we want to document the "backwards compatibility" section?

Fokko · 2024-05-23T07:54:28Z

Sorry for the late reply as I was touching grass.

We're trying to solve two problems:

Don't produce -1 since it is erroneous unless there is a snapshot with the ID -1. You can still create a table with -1 as the current snapshot ID if you set the flag. If you use Java <1.4.0 within your organization, this might be required to be able to read the tables.
When deserializing, and we encounter a -1 as the current snapshot ID, we should convert it into a None.

liurenjie1024

Thanks @s-akhtar-baig for this pr. I've a concern to control this behavior throught environment, I prefer to pass an option to serializer/deserializer so that it would easier to maintain, what do you think?

liurenjie1024 · 2024-05-24T09:59:33Z

crates/iceberg/src/spec/table_metadata.rs


+    // Skip serializing current snapshot if snapshot is none
+    // and `legacy-current-snapshot-id` is not enabled.
+    fn skip_current_snapshot(current_snapshot_id: &Option<i64>) -> bool {


I'm hesitating to control this behavior using environment, maybe we should pass options to reader or writer of table metadata?

liurenjie1024 · 2024-05-24T10:00:25Z

crates/catalog/rest/testdata/update_table_response.json

      "write.summary.partition-limit": "100",
      "write.parquet.compression-codec": "zstd"
    },
-    "current-snapshot-id": -1,


Maybe we should add another test case rather than simply removing this?

liurenjie1024 · 2024-05-27T14:44:19Z

cc @Xuanwo @Fokko @sdd @marvinlanhenke What do you think?

sdd · 2024-05-27T18:34:59Z

One problem with doing it through an env var is that it applies to every table you hit in your service. I think it would be better if it was a config param so that you can configure it per table.

s-akhtar-baig · 2024-05-31T16:58:49Z

@liurenjie1024 @sdd, sure, I can work on #375 and use a config file to set the value. Let me know if you have a different approach in mind, thanks!

liurenjie1024 · 2024-06-01T03:26:21Z

@liurenjie1024 @sdd, sure, I can work on #375 and use a config file to set the value. Let me know if you have a different approach in mind, thanks!

@s-akhtar-baig I think the first step maybe to add a config for the serializer/deseriazer of table metadata to control this behavior. Config file is just one approach to init the config, we should decouple these two things.

s-akhtar-baig · 2024-06-19T17:23:52Z

@liurenjie1024, I have pushed some changes to have config values for TableMetadata in one place. Please let me know if the direction is right and/or if you had a different idea in mind. I will handle reviews on the tests once I have your feedback on the config changes.

For now, I am using the environment to load these values but future work involves loading from a config file on top of that.

liurenjie1024 · 2024-06-26T08:22:08Z

@liurenjie1024, I have pushed some changes to have config values for TableMetadata in one place. Please let me know if the direction is right and/or if you had a different idea in mind. I will handle reviews on the tests once I have your feedback on the config changes.

For now, I am using the environment to load these values but future work involves loading from a config file on top of that.

Hi, @s-akhtar-baig Thanks for the contribution. I have little concern about current approach because it seems not extensible to me. How about this:

pub struct TableMetadataParser {
   use_legacy_id: bool,
}

impl TableMetadataParser {
  pub async fn write(&self, output_file: OutputFile, table_metadata: &TableMetadata) {
   ....
 }

pub async fn read(&self, input_file: InputFile) -> Result<TableMetadata> {
  ....
}
}

Then instead of directly ser/de table metadata, we will control behavior using TableMetadataParser, what do you think?

s-akhtar-baig · 2024-06-26T19:41:53Z

@liurenjie1024, I see. It makes sense to me and I will make changes accordingly. Thank you for adding your feedback and providing sample code!

s-akhtar-baig added 2 commits May 15, 2024 10:39

Make current-snapshot-id optional and add backwards compatibility

716f57d

Merge branch 'main' into snapshot_id_bug_fix

8b93176

s-akhtar-baig mentioned this pull request May 15, 2024

Enhancement: Add support for a config file in iceberg-rust #375

Closed

Fix code comments and CI errors

a0397b6

s-akhtar-baig added 2 commits May 15, 2024 11:49

Update testdata files

ceacfdc

Refactor code

1dfa828

Fokko mentioned this pull request May 23, 2024

feat: support append data file and add e2e test #349

Merged

liurenjie1024 reviewed May 27, 2024

View reviewed changes

s-akhtar-baig added 3 commits June 19, 2024 10:08

Add config for table metadata serde

2c4b942

Make parent snapshot id optional

92d57aa

Format code

2e86e3c

Conversation

s-akhtar-baig commented May 15, 2024 • edited by Fokko Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

s-akhtar-baig commented May 15, 2024

Uh oh!

s-akhtar-baig commented May 15, 2024

Uh oh!

Fokko commented May 23, 2024

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 May 24, 2024

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 May 24, 2024

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 commented May 27, 2024

Uh oh!

sdd commented May 27, 2024

Uh oh!

s-akhtar-baig commented May 31, 2024

Uh oh!

liurenjie1024 commented Jun 1, 2024

Uh oh!

s-akhtar-baig commented Jun 19, 2024

Uh oh!

liurenjie1024 commented Jun 26, 2024

Uh oh!

s-akhtar-baig commented Jun 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

s-akhtar-baig commented May 15, 2024 •

edited by Fokko

Loading