Skip to content

Make current-snapshot-id optional while maintaining backwards compatibility#374

Open
s-akhtar-baig wants to merge 8 commits intoapache:mainfrom
s-akhtar-baig:snapshot_id_bug_fix
Open

Make current-snapshot-id optional while maintaining backwards compatibility#374
s-akhtar-baig wants to merge 8 commits intoapache:mainfrom
s-akhtar-baig:snapshot_id_bug_fix

Conversation

@s-akhtar-baig
Copy link
Contributor

@s-akhtar-baig s-akhtar-baig commented May 15, 2024

Resolves #352

Problem: Previous versions of Java (<1.4.0) implementations incorrectly assume the optional attribute current-snapshot-id to be a required attribute in TableMetadata.

Solution: Use legacy-current-snapshot-id environment variable to force iceberg-rust to create and load a table with a metadata file compatible with the older versions.

Testing: Added new unit tests.

Future work:

@s-akhtar-baig
Copy link
Contributor Author

@Fokko, these changes make current-snapshot-id optional and uses a flag to maintain backwards compatibility. Let me know what you think.

Please note that, with the flag on we can create and load a table with -1 as the current-snapshot-id. This is different from pyiceberg which I believe only supports creating a table.

@s-akhtar-baig
Copy link
Contributor Author

Also, do we have a preference on where we want to document the "backwards compatibility" section?

@Fokko
Copy link
Contributor

Fokko commented May 23, 2024

Sorry for the late reply as I was touching grass.

We're trying to solve two problems:

  • Don't produce -1 since it is erroneous unless there is a snapshot with the ID -1. You can still create a table with -1 as the current snapshot ID if you set the flag. If you use Java <1.4.0 within your organization, this might be required to be able to read the tables.
  • When deserializing, and we encounter a -1 as the current snapshot ID, we should convert it into a None.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @s-akhtar-baig for this pr. I've a concern to control this behavior throught environment, I prefer to pass an option to serializer/deserializer so that it would easier to maintain, what do you think?


// Skip serializing current snapshot if snapshot is none
// and `legacy-current-snapshot-id` is not enabled.
fn skip_current_snapshot(current_snapshot_id: &Option<i64>) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitating to control this behavior using environment, maybe we should pass options to reader or writer of table metadata?

"write.summary.partition-limit": "100",
"write.parquet.compression-codec": "zstd"
},
"current-snapshot-id": -1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add another test case rather than simply removing this?

@liurenjie1024
Copy link
Contributor

cc @Xuanwo @Fokko @sdd @marvinlanhenke What do you think?

@sdd
Copy link
Contributor

sdd commented May 27, 2024

One problem with doing it through an env var is that it applies to every table you hit in your service. I think it would be better if it was a config param so that you can configure it per table.

@s-akhtar-baig
Copy link
Contributor Author

@liurenjie1024 @sdd, sure, I can work on #375 and use a config file to set the value. Let me know if you have a different approach in mind, thanks!

@liurenjie1024
Copy link
Contributor

@liurenjie1024 @sdd, sure, I can work on #375 and use a config file to set the value. Let me know if you have a different approach in mind, thanks!

@s-akhtar-baig I think the first step maybe to add a config for the serializer/deseriazer of table metadata to control this behavior. Config file is just one approach to init the config, we should decouple these two things.

@s-akhtar-baig
Copy link
Contributor Author

@liurenjie1024, I have pushed some changes to have config values for TableMetadata in one place. Please let me know if the direction is right and/or if you had a different idea in mind. I will handle reviews on the tests once I have your feedback on the config changes.

For now, I am using the environment to load these values but future work involves loading from a config file on top of that.

@liurenjie1024
Copy link
Contributor

@liurenjie1024, I have pushed some changes to have config values for TableMetadata in one place. Please let me know if the direction is right and/or if you had a different idea in mind. I will handle reviews on the tests once I have your feedback on the config changes.

For now, I am using the environment to load these values but future work involves loading from a config file on top of that.

Hi, @s-akhtar-baig Thanks for the contribution. I have little concern about current approach because it seems not extensible to me. How about this:

pub struct TableMetadataParser {
   use_legacy_id: bool,
}

impl TableMetadataParser {
  pub async fn write(&self, output_file: OutputFile, table_metadata: &TableMetadata) {
   ....
 }

pub async fn read(&self, input_file: InputFile) -> Result<TableMetadata> {
  ....
}
}

Then instead of directly ser/de table metadata, we will control behavior using TableMetadataParser, what do you think?

@s-akhtar-baig
Copy link
Contributor Author

@liurenjie1024, I see. It makes sense to me and I will make changes accordingly. Thank you for adding your feedback and providing sample code!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Empty snapshot ID should be Null instead of -1

4 participants