Make current-snapshot-id optional while maintaining backwards compatibility#374
Make current-snapshot-id optional while maintaining backwards compatibility#374s-akhtar-baig wants to merge 8 commits intoapache:mainfrom
Conversation
|
@Fokko, these changes make current-snapshot-id optional and uses a flag to maintain backwards compatibility. Let me know what you think. Please note that, with the flag on we can create and load a table with -1 as the current-snapshot-id. This is different from pyiceberg which I believe only supports creating a table. |
|
Also, do we have a preference on where we want to document the "backwards compatibility" section? |
|
Sorry for the late reply as I was touching grass. We're trying to solve two problems:
|
liurenjie1024
left a comment
There was a problem hiding this comment.
Thanks @s-akhtar-baig for this pr. I've a concern to control this behavior throught environment, I prefer to pass an option to serializer/deserializer so that it would easier to maintain, what do you think?
|
|
||
| // Skip serializing current snapshot if snapshot is none | ||
| // and `legacy-current-snapshot-id` is not enabled. | ||
| fn skip_current_snapshot(current_snapshot_id: &Option<i64>) -> bool { |
There was a problem hiding this comment.
I'm hesitating to control this behavior using environment, maybe we should pass options to reader or writer of table metadata?
| "write.summary.partition-limit": "100", | ||
| "write.parquet.compression-codec": "zstd" | ||
| }, | ||
| "current-snapshot-id": -1, |
There was a problem hiding this comment.
Maybe we should add another test case rather than simply removing this?
|
cc @Xuanwo @Fokko @sdd @marvinlanhenke What do you think? |
|
One problem with doing it through an env var is that it applies to every table you hit in your service. I think it would be better if it was a config param so that you can configure it per table. |
|
@liurenjie1024 @sdd, sure, I can work on #375 and use a config file to set the value. Let me know if you have a different approach in mind, thanks! |
@s-akhtar-baig I think the first step maybe to add a config for the serializer/deseriazer of table metadata to control this behavior. Config file is just one approach to init the config, we should decouple these two things. |
|
@liurenjie1024, I have pushed some changes to have config values for TableMetadata in one place. Please let me know if the direction is right and/or if you had a different idea in mind. I will handle reviews on the tests once I have your feedback on the config changes. For now, I am using the environment to load these values but future work involves loading from a config file on top of that. |
Hi, @s-akhtar-baig Thanks for the contribution. I have little concern about current approach because it seems not extensible to me. How about this: pub struct TableMetadataParser {
use_legacy_id: bool,
}
impl TableMetadataParser {
pub async fn write(&self, output_file: OutputFile, table_metadata: &TableMetadata) {
....
}
pub async fn read(&self, input_file: InputFile) -> Result<TableMetadata> {
....
}
}Then instead of directly ser/de table metadata, we will control behavior using |
|
@liurenjie1024, I see. It makes sense to me and I will make changes accordingly. Thank you for adding your feedback and providing sample code! |
Resolves #352
Problem: Previous versions of Java (
<1.4.0) implementations incorrectly assume the optional attributecurrent-snapshot-idto be a required attribute in TableMetadata.Solution: Use
legacy-current-snapshot-idenvironment variable to force iceberg-rust to create and load a table with a metadata file compatible with the older versions.Testing: Added new unit tests.
Future work:
legacy-current-snapshot-idas a configuration value.