-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: allow creating v2 tables through table property #2887
Conversation
Thanks for working on this, @jackye1995! I think we want this to be more general than doing it in |
c2aeb33
to
bbc9740
Compare
@rdblue thanks for the comment! I was thinking to only allow people to create table in this way and update property will not have any effect, that's why I did it in the table builder directly. But now I think about it I realize it will also be beneficial to use this syntax to migrate existing tables to v2: ALTER TABLE SET TBLPROPERTIES ('format-version'='2') So it does make sense to do it directly in |
* <p> | ||
* Note: If you set the Iceberg format to an unreleased version, software stability is not guaranteed. | ||
*/ | ||
public static final String PROPERTY_FORMAT_VERSION = "format-version"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we miss to override the default format version to be 2 in the replaceTransaction code path ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are right, updated!
Map<String, String> validProperties = newProperties; | ||
int newFormatVersion = formatVersion; | ||
if (newProperties.containsKey(PROPERTY_FORMAT_VERSION)) { | ||
validProperties = Maps.newHashMap(newProperties); | ||
newFormatVersion = Integer.parseInt(validProperties.remove(PROPERTY_FORMAT_VERSION)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see lots of this code block in this PR, I'd like to have a small abstracted static method:
private static int parseFormatVersion(Map<String, String> properties, int defaultFormatVersion) {
return properties.containsKey(PROPERTY_FORMAT_VERSION) ?
Integer.parseInt(properties.remove(PROPERTY_FORMAT_VERSION)) : defaultFormatVersion;
}
Then we could simplify this block as:
Map<String, String> validProperties = Maps.newHashMap(newProperties);
int newFormatVersion = parseFormatVersion(validProperties, formatVersion);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also already have PropertyUtil
that you can use to make this one line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR looks good to me, just left a minor comment to simplify the code.
@@ -756,4 +756,54 @@ public void testUpdateSchema() { | |||
Assert.assertEquals("Should return expected last column id", | |||
6, threeSchemaTable.lastColumnId()); | |||
} | |||
|
|||
@Test | |||
public void testCreateV2MetadataThroughTableProperty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also add the unit tests in flink/spark/hive module to verify the end-to-end DDL work ? If so, I think we could make it to be a separate PR for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit hard to test this in unit test because format version is not publicly accessible. I have tested with Spark on EMR and manually verified that the metadata file shows the correct version.
I think we could access the format version by using the following code:
Table table = ...
TableOperations ops = ((BaseTable) table).operations();
TableMetadata meta = ops.current();
int formatVersion = meta.formatVersion();
@@ -47,6 +47,19 @@ | |||
* Metadata for a table. | |||
*/ | |||
public class TableMetadata implements Serializable { | |||
|
|||
/** | |||
* Iceberg always use the latest stable format version when creating a new table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't correct after this commit. I think it should be "Reserved table property for table format version. Iceberg will default a new table's format version to the latest stable and recommended version."
Also, I think this should note that incomplete or unstable versions cannot be selected using this property. When we add v3, this should not be able to set it.
* If a table updates this property, it will try to upgrade to the specified format version. | ||
* This helps developers to try out new features in an unreleased version or migrate existing tables to a new version. | ||
* <p> | ||
* Note: If you set the Iceberg format to an unreleased version, software stability is not guaranteed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reserved property should not allow setting the format version to anything other than a stable and adopted version.
@@ -90,6 +103,14 @@ static TableMetadata newTableMetadata(Schema schema, | |||
String location, | |||
Map<String, String> properties, | |||
int formatVersion) { | |||
// check if there is format version override in properties | |||
Map<String, String> validProperties = properties; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to rename the method argument (maybe rawProperties
) to result in fewer changes and minimize the chance that a reference was missed. Also, if we are introducing reserved table properties, then we should make this more generic, like this:
Map<String, String> requestedProperties = rawProperties.entrySet().stream()
.filter(entry -> !RESERVED_PROPERTIES.contains(entry.getKey()))
.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
int formatVersion = PropertyUtil.propertyAsInt(
rawProperties, PROPERTY_FORMAT_VERSION, DEFAULT_TABLE_FORMAT_VERSION);
@@ -90,6 +103,14 @@ static TableMetadata newTableMetadata(Schema schema, | |||
String location, | |||
Map<String, String> properties, | |||
int formatVersion) { | |||
// check if there is format version override in properties | |||
Map<String, String> validProperties = properties; | |||
int formatVersionOverride = formatVersion; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to handle the conflict between formatVersion
and properties
. Since all of these methods are internal, we can change them however we like, but we should not have 2 ways to set the table version. I think there are two options:
- Set
tableVersion
from properties in the 2 overrides and validate that properties does not containformat-version
here. - Move the implementation to the override above that doesn't accept
formatVersion
. Then update this override to convertformatVersion
to a property and validate that the properties passed here don't containformat-version
.
I would go for option 2. That makes handling a set of reserved properties cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more option I just thought about. I've been considering adding a builder here. We could use a builder and then set some validation so that formatVersion
can only be configured once.
snapshotLog, addPreviousFile(file, lastUpdatedMillis, validProperties)); | ||
|
||
if (formatVersion != newFormatVersion) { | ||
metadata = metadata.upgradeToFormatVersion(newFormatVersion); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit weird to call another instance method to produce a second TableMetadata
here. I'm thinking that maybe the builder pattern would work better for complicated cases... Seems fine for now though.
a7c1df9
to
466fe26
Compare
@openinx thanks for the suggestion, I added tests for Spark, Flink and Hive. @rdblue I agree with the suggestions for the reserved table properties, I was a bit hesitated about if I should generalize it at this stage or not. If we want to do that, I have moved those static variables to For the builder pattern, yes I absolutely agree that we should have a builder given the number of arguments in the static methods and the variation of inputs. I can do that with a separated PR, and I was hoping this can get into 0.12.0 since we have voted to finalize format v2. It would be a bit awkward if the spec is finalized but people cannot try it out. Please let me know if this is possible. If so I can focus on getting this PR merged first, and then I can do a best effort to get the builder in, which can be before or after 0.12.0. |
@@ -57,31 +59,29 @@ | |||
|
|||
private static final long ONE_MINUTE = TimeUnit.MINUTES.toMillis(1); | |||
|
|||
/** | |||
* @deprecated will be removed in 0.9.0; use newTableMetadata(Schema, PartitionSpec, String, Map) instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for removing this!
@@ -47,6 +48,7 @@ | |||
* Metadata for a table. | |||
*/ | |||
public class TableMetadata implements Serializable { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: unnecessary whitespace change.
Map<String, String> newRawProperties = Maps.newHashMap(); | ||
newRawProperties.putAll(this.properties); | ||
newRawProperties.putAll(updatedProperties); | ||
Map<String, String> newProperties = unreservedProperties(newRawProperties); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to construct the properties directly rather than building a map and then rebuilding it. You should be able to use the same filter
to add just the non-reserved properties from updatedProperties
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could have unreservedProperties
return something else, like a filtered Stream
if you wanted to still reuse it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackye1995, I did a complete review. This is about ready to go, but I think there are a couple of things to fix. Thanks!
@rdblue fixed based on your comments, should be good for another look |
} | ||
|
||
public static TableMetadata newTableMetadata(Schema schema, | ||
PartitionSpec spec, | ||
String location, | ||
Map<String, String> properties) { | ||
return newTableMetadata(schema, spec, SortOrder.unsorted(), location, properties, DEFAULT_TABLE_FORMAT_VERSION); | ||
return newTableMetadata(schema, spec, SortOrder.unsorted(), location, unreservedProperties(properties), | ||
PropertyUtil.propertyAsInt(properties, TableProperties.RESERVED_PROPERTY_FORMAT_VERSION, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think defining some helper variables will make it a little bit easier to read. I usually try to avoid splitting code over multiple lines if possible.
SortOrder sortOrder = SortOrder.unsorted();
int formatVersion = PropertyUtil.propertyAsInt(properties,
TableProperties.RESERVED_PROPERTY_FORMAT_VERSION,
DEFAULT_TABLE_FORMAT_VERSION);
return newTableMetadata(schema, spec, sortOrder, location, unreservedProperties(properties), formatVersion);
* <p> | ||
* Note: incomplete or unstable versions cannot be selected using this property. | ||
*/ | ||
public static final String RESERVED_PROPERTY_FORMAT_VERSION = "format-version"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Do we add the RESERVED_PROPERTY_
prefix to be explicit? I am wondering whether it is worth making the var name longer as we have RESERVED_PROPERTIES
below.
Just asking. I am fine either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a good point, I had the prefix because there was no RESERVED_PROPERTIES
set. But now it becomes a bit redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The latest changes are only to Javadoc, so I'm going to merge this based on the test result for 34094f2. |
Merge remote-tracking branch 'upstream/merge-master-20210816' into master ## 该MR主要解决什么? merge upstream/master,引入最近的一些bugFix和优化 ## 该MR的修改是什么? 核心关注PR: > Predicate PushDown 支持,https://github.com/apache/iceberg/pull/2358, https://github.com/apache/iceberg/pull/2926, https://github.com/apache/iceberg/pull/2777/files > Spark场景写入空dataset 报错问题,直接skip掉即可, apache#2960 > Flink UI补充uidPrefix到operator方便跟踪多个iceberg sink任务, apache#288 > Spark 修复nested Struct Pruning问题, apache#2877 > 可以使用Table Properties指定创建v2 format表,apache#2887 > 补充SortRewriteStrategy框架,逐步支持不同rewrite策略, apache#2609 (WIP:apache#2829) > Spark 为catalog配置hadoop属性支持, apache#2792 > Spark 针对timestamps without timezone读写支持, apache#2757 > Spark MicroBatch支持配置属性skip delete snapshots, apache#2752 > Spark V2 RewriteDatafilesAction 支持 > Core: Add validation for row-level deletes with rewrites, apache#2865 > schema time travel 功能相关,补充schema-id, Core: add schema id to snapshot > Spark Extension支持identifier fields操作, apache#2560 > Parquet: Update to 1.12.0, apache#2441 > Hive: Vectorized ORC reads for Hive, apache#2613 > Spark: Add an action to remove all referenced files, apache#2415 ## 该MR是如何测试的? UT
Based on the discussion on dev list, it would be good to have a way to directly create experimental v2 tables. This is my current proposal:
When
format-version
is specified in the table properties section of a CREATE TABLE DDL, we use the specific format version, otherwise use default versionSome considerations I had:
format-version
is not a part ofTableProperties
, but just a property used at table creation time. The value is never stored as actual table property, and updating this property has no effect on table version.format.version
but instead usetable-format
to distinguish it from other actual table properties.It is a bit hard to test this in unit test because format version is not publicly accessible. I have tested with Spark on EMR and manually verified that the metadata file shows the correct version.
Example:
@rdblue @openinx