Core: allow creating v2 tables through table property #2887

jackye1995 · 2021-07-28T20:22:59Z

Based on the discussion on dev list, it would be good to have a way to directly create experimental v2 tables. This is my current proposal:

When format-version is specified in the table properties section of a CREATE TABLE DDL, we use the specific format version, otherwise use default version

Some considerations I had:

format-version is not a part of TableProperties, but just a property used at table creation time. The value is never stored as actual table property, and updating this property has no effect on table version.
Because of 1, we do not use a traditional property syntax like format.version but instead use table-format to distinguish it from other actual table properties.
I am trying to make this a long-term solution for future format versions, so I made the property a public variable that can be imported.
The table properties section of a CREATE TABLE DDL is used because it exists in all SQL dialects, which means that all engines can have this functionality to test experimental feature through this change.

It is a bit hard to test this in unit test because format version is not publicly accessible. I have tested with Spark on EMR and manually verified that the metadata file shows the correct version.

Example:

CREATE TABLE sample (
    id bigint,
    data string,
    category string)
USING iceberg
OPTIONS ('format-version'='2')
PARTITIONED BY (category);

@rdblue @openinx

rdblue · 2021-07-28T21:21:38Z

Thanks for working on this, @jackye1995!

I think we want this to be more general than doing it in BaseMetastoreCatalog. If we are going to set the table version through a reserved property, then I think we should build the reserved property into TableMetadata so it is consistent across all catalogs.

jackye1995 · 2021-07-29T00:58:10Z

If we are going to set the table version through a reserved property, then I think we should build the reserved property into TableMetadata so it is consistent across all catalogs.

@rdblue thanks for the comment! I was thinking to only allow people to create table in this way and update property will not have any effect, that's why I did it in the table builder directly. But now I think about it I realize it will also be beneficial to use this syntax to migrate existing tables to v2:

ALTER TABLE SET TBLPROPERTIES ('format-version'='2')

So it does make sense to do it directly in TableMetadata. I have updated the code and also tested with Spark to verify it works, please let me know if there is any other concern.

openinx · 2021-07-29T02:28:25Z

core/src/main/java/org/apache/iceberg/TableMetadata.java

+   * <p>
+   * Note: If you set the Iceberg format to an unreleased version, software stability is not guaranteed.
+   */
+  public static final String PROPERTY_FORMAT_VERSION = "format-version";


Do we miss to override the default format version to be 2 in the replaceTransaction code path ?

Yes you are right, updated!

openinx · 2021-07-30T09:02:58Z

core/src/main/java/org/apache/iceberg/TableMetadata.java

+    Map<String, String> validProperties = newProperties;
+    int newFormatVersion = formatVersion;
+    if (newProperties.containsKey(PROPERTY_FORMAT_VERSION)) {
+      validProperties = Maps.newHashMap(newProperties);
+      newFormatVersion = Integer.parseInt(validProperties.remove(PROPERTY_FORMAT_VERSION));
+    }


I see lots of this code block in this PR, I'd like to have a small abstracted static method:

private static int parseFormatVersion(Map<String, String> properties, int defaultFormatVersion) { return properties.containsKey(PROPERTY_FORMAT_VERSION) ? Integer.parseInt(properties.remove(PROPERTY_FORMAT_VERSION)) : defaultFormatVersion; }

Then we could simplify this block as:

Map<String, String> validProperties = Maps.newHashMap(newProperties); int newFormatVersion = parseFormatVersion(validProperties, formatVersion);

We also already have PropertyUtil that you can use to make this one line.

openinx

PR looks good to me, just left a minor comment to simplify the code.

openinx · 2021-07-30T09:10:59Z

core/src/test/java/org/apache/iceberg/TestTableMetadata.java

@@ -756,4 +756,54 @@ public void testUpdateSchema() {
    Assert.assertEquals("Should return expected last column id",
        6, threeSchemaTable.lastColumnId());
  }
+
+  @Test
+  public void testCreateV2MetadataThroughTableProperty() {


Should we also add the unit tests in flink/spark/hive module to verify the end-to-end DDL work ? If so, I think we could make it to be a separate PR for this.

It is a bit hard to test this in unit test because format version is not publicly accessible. I have tested with Spark on EMR and manually verified that the metadata file shows the correct version.

I think we could access the format version by using the following code:

Table table = ... TableOperations ops = ((BaseTable) table).operations(); TableMetadata meta = ops.current(); int formatVersion = meta.formatVersion();

rdblue · 2021-08-01T21:59:30Z

core/src/main/java/org/apache/iceberg/TableMetadata.java

@@ -47,6 +47,19 @@
 * Metadata for a table.
 */
 public class TableMetadata implements Serializable {
+
+  /**
+   * Iceberg always use the latest stable format version when creating a new table.


This isn't correct after this commit. I think it should be "Reserved table property for table format version. Iceberg will default a new table's format version to the latest stable and recommended version."

Also, I think this should note that incomplete or unstable versions cannot be selected using this property. When we add v3, this should not be able to set it.

rdblue · 2021-08-01T21:59:56Z

core/src/main/java/org/apache/iceberg/TableMetadata.java

+   * If a table updates this property, it will try to upgrade to the specified format version.
+   * This helps developers to try out new features in an unreleased version or migrate existing tables to a new version.
+   * <p>
+   * Note: If you set the Iceberg format to an unreleased version, software stability is not guaranteed.


This reserved property should not allow setting the format version to anything other than a stable and adopted version.

rdblue · 2021-08-01T22:08:27Z

core/src/main/java/org/apache/iceberg/TableMetadata.java

@@ -90,6 +103,14 @@ static TableMetadata newTableMetadata(Schema schema,
                                        String location,
                                        Map<String, String> properties,
                                        int formatVersion) {
+    // check if there is format version override in properties
+    Map<String, String> validProperties = properties;


It would be better to rename the method argument (maybe rawProperties) to result in fewer changes and minimize the chance that a reference was missed. Also, if we are introducing reserved table properties, then we should make this more generic, like this:

Map<String, String> requestedProperties = rawProperties.entrySet().stream() .filter(entry -> !RESERVED_PROPERTIES.contains(entry.getKey())) .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue)); int formatVersion = PropertyUtil.propertyAsInt( rawProperties, PROPERTY_FORMAT_VERSION, DEFAULT_TABLE_FORMAT_VERSION);

rdblue · 2021-08-01T22:12:39Z

core/src/main/java/org/apache/iceberg/TableMetadata.java

@@ -90,6 +103,14 @@ static TableMetadata newTableMetadata(Schema schema,
                                        String location,
                                        Map<String, String> properties,
                                        int formatVersion) {
+    // check if there is format version override in properties
+    Map<String, String> validProperties = properties;
+    int formatVersionOverride = formatVersion;


I think this needs to handle the conflict between formatVersion and properties. Since all of these methods are internal, we can change them however we like, but we should not have 2 ways to set the table version. I think there are two options:

Set tableVersion from properties in the 2 overrides and validate that properties does not contain format-version here.

Move the implementation to the override above that doesn't accept formatVersion. Then update this override to convert formatVersion to a property and validate that the properties passed here don't contain format-version.

I would go for option 2. That makes handling a set of reserved properties cleaner.

One more option I just thought about. I've been considering adding a builder here. We could use a builder and then set some validation so that formatVersion can only be configured once.

rdblue · 2021-08-01T22:22:27Z

core/src/main/java/org/apache/iceberg/TableMetadata.java

+        snapshotLog, addPreviousFile(file, lastUpdatedMillis, validProperties));
+
+    if (formatVersion != newFormatVersion) {
+      metadata = metadata.upgradeToFormatVersion(newFormatVersion);


It seems a bit weird to call another instance method to produce a second TableMetadata here. I'm thinking that maybe the builder pattern would work better for complicated cases... Seems fine for now though.

jackye1995 · 2021-08-02T21:25:40Z

@openinx thanks for the suggestion, I added tests for Spark, Flink and Hive.

@rdblue I agree with the suggestions for the reserved table properties, I was a bit hesitated about if I should generalize it at this stage or not. If we want to do that, I have moved those static variables to TableProperties which I think is a better place to expose it, please let me know if you agree with this placement or not.

For the builder pattern, yes I absolutely agree that we should have a builder given the number of arguments in the static methods and the variation of inputs. I can do that with a separated PR, and I was hoping this can get into 0.12.0 since we have voted to finalize format v2. It would be a bit awkward if the spec is finalized but people cannot try it out. Please let me know if this is possible. If so I can focus on getting this PR merged first, and then I can do a best effort to get the builder in, which can be before or after 0.12.0.

rdblue · 2021-08-02T23:27:35Z

core/src/main/java/org/apache/iceberg/TableMetadata.java

@@ -57,31 +59,29 @@

  private static final long ONE_MINUTE = TimeUnit.MINUTES.toMillis(1);

-  /**
-   * @deprecated will be removed in 0.9.0; use newTableMetadata(Schema, PartitionSpec, String, Map) instead.


Thanks for removing this!

rdblue · 2021-08-02T23:27:59Z

core/src/main/java/org/apache/iceberg/TableMetadata.java

@@ -47,6 +48,7 @@
 * Metadata for a table.
 */
 public class TableMetadata implements Serializable {
+


Nit: unnecessary whitespace change.

core/src/main/java/org/apache/iceberg/TableMetadata.java

rdblue · 2021-08-03T15:41:07Z

core/src/main/java/org/apache/iceberg/TableMetadata.java

+    Map<String, String> newRawProperties = Maps.newHashMap();
+    newRawProperties.putAll(this.properties);
+    newRawProperties.putAll(updatedProperties);
+    Map<String, String> newProperties = unreservedProperties(newRawProperties);


I think it would be better to construct the properties directly rather than building a map and then rebuilding it. You should be able to use the same filter to add just the non-reserved properties from updatedProperties.

You could have unreservedProperties return something else, like a filtered Stream if you wanted to still reuse it.

core/src/main/java/org/apache/iceberg/TableMetadata.java

core/src/main/java/org/apache/iceberg/TableProperties.java

flink/src/test/java/org/apache/iceberg/flink/TestFlinkCatalogTable.java

rdblue

@jackye1995, I did a complete review. This is about ready to go, but I think there are a couple of things to fix. Thanks!

jackye1995 · 2021-08-03T17:32:47Z

@rdblue fixed based on your comments, should be good for another look

aokolnychyi · 2021-08-03T17:50:37Z

core/src/main/java/org/apache/iceberg/TableMetadata.java

  }

  public static TableMetadata newTableMetadata(Schema schema,
                                               PartitionSpec spec,
                                               String location,
                                               Map<String, String> properties) {
-    return newTableMetadata(schema, spec, SortOrder.unsorted(), location, properties, DEFAULT_TABLE_FORMAT_VERSION);
+    return newTableMetadata(schema, spec, SortOrder.unsorted(), location, unreservedProperties(properties),
+        PropertyUtil.propertyAsInt(properties, TableProperties.RESERVED_PROPERTY_FORMAT_VERSION,


nit: I think defining some helper variables will make it a little bit easier to read. I usually try to avoid splitting code over multiple lines if possible.

SortOrder sortOrder = SortOrder.unsorted(); int formatVersion = PropertyUtil.propertyAsInt(properties, TableProperties.RESERVED_PROPERTY_FORMAT_VERSION, DEFAULT_TABLE_FORMAT_VERSION); return newTableMetadata(schema, spec, sortOrder, location, unreservedProperties(properties), formatVersion);

aokolnychyi · 2021-08-03T17:55:01Z

core/src/main/java/org/apache/iceberg/TableProperties.java

+   * <p>
+   * Note: incomplete or unstable versions cannot be selected using this property.
+   */
+  public static final String RESERVED_PROPERTY_FORMAT_VERSION = "format-version";


nit: Do we add the RESERVED_PROPERTY_ prefix to be explicit? I am wondering whether it is worth making the var name longer as we have RESERVED_PROPERTIES below.

Just asking. I am fine either way.

Yeah that's a good point, I had the prefix because there was no RESERVED_PROPERTIES set. But now it becomes a bit redundant.

aokolnychyi

LGTM

core/src/main/java/org/apache/iceberg/TableProperties.java

rdblue · 2021-08-03T22:27:33Z

The latest changes are only to Javadoc, so I'm going to merge this based on the test result for 34094f2.

Merge remote-tracking branch 'upstream/merge-master-20210816' into master ## 该MR主要解决什么？ merge upstream/master，引入最近的一些bugFix和优化 ## 该MR的修改是什么？核心关注PR： > Predicate PushDown 支持，https://github.com/apache/iceberg/pull/2358， https://github.com/apache/iceberg/pull/2926， https://github.com/apache/iceberg/pull/2777/files > Spark场景写入空dataset 报错问题，直接skip掉即可， apache#2960 > Flink UI补充uidPrefix到operator方便跟踪多个iceberg sink任务， apache#288 > Spark 修复nested Struct Pruning问题， apache#2877 > 可以使用Table Properties指定创建v2 format表，apache#2887 > 补充SortRewriteStrategy框架，逐步支持不同rewrite策略， apache#2609 （WIP：apache#2829） > Spark 为catalog配置hadoop属性支持， apache#2792 > Spark 针对timestamps without timezone读写支持， apache#2757 > Spark MicroBatch支持配置属性skip delete snapshots， apache#2752 > Spark V2 RewriteDatafilesAction 支持 > Core: Add validation for row-level deletes with rewrites, apache#2865 > schema time travel 功能相关，补充schema-id， Core: add schema id to snapshot > Spark Extension支持identifier fields操作， apache#2560 > Parquet: Update to 1.12.0, apache#2441 > Hive: Vectorized ORC reads for Hive, apache#2613 > Spark: Add an action to remove all referenced files, apache#2415 ## 该MR是如何测试的？ UT

github-actions bot added the core label Jul 28, 2021

jackye1995 force-pushed the v2-create-table branch from c2aeb33 to bbc9740 Compare July 29, 2021 00:57

openinx reviewed Jul 29, 2021

View reviewed changes

openinx reviewed Jul 30, 2021

View reviewed changes

openinx requested changes Jul 30, 2021

View reviewed changes

openinx reviewed Jul 30, 2021

View reviewed changes

rdblue reviewed Aug 1, 2021

View reviewed changes

Core: allow creating v2 tables through table property

466fe26

jackye1995 force-pushed the v2-create-table branch from a7c1df9 to 466fe26 Compare August 2, 2021 21:19

github-actions bot added flink MR spark labels Aug 2, 2021

fix checkstyle

257eaaf

rdblue reviewed Aug 2, 2021

View reviewed changes

remove extra lines

4ba96b3

rdblue reviewed Aug 3, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/TableMetadata.java Outdated Show resolved Hide resolved

rdblue reviewed Aug 3, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/TableMetadata.java Outdated Show resolved Hide resolved

rdblue reviewed Aug 3, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/TableProperties.java Outdated Show resolved Hide resolved

rdblue reviewed Aug 3, 2021

View reviewed changes

flink/src/test/java/org/apache/iceberg/flink/TestFlinkCatalogTable.java Show resolved Hide resolved

rdblue requested changes Aug 3, 2021

View reviewed changes

fix based on comments

bff76bb

fix checkstyle

197fc6a

aokolnychyi reviewed Aug 3, 2021

View reviewed changes

add helper variable

083dec1

aokolnychyi approved these changes Aug 3, 2021

View reviewed changes

rename format version

34094f2

rdblue reviewed Aug 3, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/TableProperties.java Outdated Show resolved Hide resolved

rdblue reviewed Aug 3, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/TableProperties.java Show resolved Hide resolved

rdblue reviewed Aug 3, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/TableProperties.java Outdated Show resolved Hide resolved

update javadoc

aae17e2

rdblue approved these changes Aug 3, 2021

View reviewed changes

rdblue merged commit 0b2f065 into apache:master Aug 3, 2021

jackye1995 mentioned this pull request Aug 11, 2021

Core: add TableMetadata update builder #2957

Closed

openinx mentioned this pull request Sep 9, 2021

Help needed in migrating parquet file to iceberg table. #3042

Closed

yyanyy mentioned this pull request Sep 13, 2021

Add metadata tables tests to make sure they don't break when reading different versions of tables #2532

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: allow creating v2 tables through table property #2887

Core: allow creating v2 tables through table property #2887

jackye1995 commented Jul 28, 2021 •

edited

Loading

rdblue commented Jul 28, 2021

jackye1995 commented Jul 29, 2021

openinx Jul 29, 2021

jackye1995 Jul 29, 2021

openinx Jul 30, 2021

rdblue Aug 1, 2021

openinx left a comment

openinx Jul 30, 2021

openinx Jul 30, 2021

rdblue Aug 1, 2021

rdblue Aug 1, 2021

rdblue Aug 1, 2021 •

edited

Loading

rdblue Aug 1, 2021

rdblue Aug 1, 2021

rdblue Aug 1, 2021

jackye1995 commented Aug 2, 2021

rdblue Aug 2, 2021

rdblue Aug 2, 2021

rdblue Aug 3, 2021

rdblue Aug 3, 2021

rdblue left a comment

jackye1995 commented Aug 3, 2021

aokolnychyi Aug 3, 2021

aokolnychyi Aug 3, 2021

jackye1995 Aug 3, 2021

aokolnychyi left a comment

rdblue commented Aug 3, 2021

Core: allow creating v2 tables through table property #2887

Core: allow creating v2 tables through table property #2887

Conversation

jackye1995 commented Jul 28, 2021 • edited Loading

rdblue commented Jul 28, 2021

jackye1995 commented Jul 29, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openinx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Aug 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackye1995 commented Aug 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue left a comment

Choose a reason for hiding this comment

jackye1995 commented Aug 3, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aokolnychyi left a comment

Choose a reason for hiding this comment

rdblue commented Aug 3, 2021

jackye1995 commented Jul 28, 2021 •

edited

Loading

rdblue Aug 1, 2021 •

edited

Loading