Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: allow creating v2 tables through table property #2887

Merged
merged 8 commits into from
Aug 3, 2021

Conversation

jackye1995
Copy link
Contributor

@jackye1995 jackye1995 commented Jul 28, 2021

Based on the discussion on dev list, it would be good to have a way to directly create experimental v2 tables. This is my current proposal:

When format-version is specified in the table properties section of a CREATE TABLE DDL, we use the specific format version, otherwise use default version

Some considerations I had:

  1. format-version is not a part of TableProperties, but just a property used at table creation time. The value is never stored as actual table property, and updating this property has no effect on table version.
  2. Because of 1, we do not use a traditional property syntax like format.version but instead use table-format to distinguish it from other actual table properties.
  3. I am trying to make this a long-term solution for future format versions, so I made the property a public variable that can be imported.
  4. The table properties section of a CREATE TABLE DDL is used because it exists in all SQL dialects, which means that all engines can have this functionality to test experimental feature through this change.

It is a bit hard to test this in unit test because format version is not publicly accessible. I have tested with Spark on EMR and manually verified that the metadata file shows the correct version.

Example:

CREATE TABLE sample (
    id bigint,
    data string,
    category string)
USING iceberg
OPTIONS ('format-version'='2')
PARTITIONED BY (category);

@rdblue @openinx

@github-actions github-actions bot added the core label Jul 28, 2021
@rdblue
Copy link
Contributor

rdblue commented Jul 28, 2021

Thanks for working on this, @jackye1995!

I think we want this to be more general than doing it in BaseMetastoreCatalog. If we are going to set the table version through a reserved property, then I think we should build the reserved property into TableMetadata so it is consistent across all catalogs.

@jackye1995
Copy link
Contributor Author

If we are going to set the table version through a reserved property, then I think we should build the reserved property into TableMetadata so it is consistent across all catalogs.

@rdblue thanks for the comment! I was thinking to only allow people to create table in this way and update property will not have any effect, that's why I did it in the table builder directly. But now I think about it I realize it will also be beneficial to use this syntax to migrate existing tables to v2:

ALTER TABLE SET TBLPROPERTIES ('format-version'='2')

So it does make sense to do it directly in TableMetadata. I have updated the code and also tested with Spark to verify it works, please let me know if there is any other concern.

* <p>
* Note: If you set the Iceberg format to an unreleased version, software stability is not guaranteed.
*/
public static final String PROPERTY_FORMAT_VERSION = "format-version";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we miss to override the default format version to be 2 in the replaceTransaction code path ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right, updated!

Comment on lines 703 to 708
Map<String, String> validProperties = newProperties;
int newFormatVersion = formatVersion;
if (newProperties.containsKey(PROPERTY_FORMAT_VERSION)) {
validProperties = Maps.newHashMap(newProperties);
newFormatVersion = Integer.parseInt(validProperties.remove(PROPERTY_FORMAT_VERSION));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see lots of this code block in this PR, I'd like to have a small abstracted static method:

  private static int parseFormatVersion(Map<String, String> properties, int defaultFormatVersion) {
    return properties.containsKey(PROPERTY_FORMAT_VERSION) ?
        Integer.parseInt(properties.remove(PROPERTY_FORMAT_VERSION)) : defaultFormatVersion;
  }

Then we could simplify this block as:

    Map<String, String> validProperties = Maps.newHashMap(newProperties);
    int newFormatVersion = parseFormatVersion(validProperties, formatVersion);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also already have PropertyUtil that you can use to make this one line.

Copy link
Member

@openinx openinx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good to me, just left a minor comment to simplify the code.

@@ -756,4 +756,54 @@ public void testUpdateSchema() {
Assert.assertEquals("Should return expected last column id",
6, threeSchemaTable.lastColumnId());
}

@Test
public void testCreateV2MetadataThroughTableProperty() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add the unit tests in flink/spark/hive module to verify the end-to-end DDL work ? If so, I think we could make it to be a separate PR for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit hard to test this in unit test because format version is not publicly accessible. I have tested with Spark on EMR and manually verified that the metadata file shows the correct version.

I think we could access the format version by using the following code:

Table table = ...
TableOperations ops = ((BaseTable) table).operations();
TableMetadata meta = ops.current();
int formatVersion = meta.formatVersion();

@@ -47,6 +47,19 @@
* Metadata for a table.
*/
public class TableMetadata implements Serializable {

/**
* Iceberg always use the latest stable format version when creating a new table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't correct after this commit. I think it should be "Reserved table property for table format version. Iceberg will default a new table's format version to the latest stable and recommended version."

Also, I think this should note that incomplete or unstable versions cannot be selected using this property. When we add v3, this should not be able to set it.

* If a table updates this property, it will try to upgrade to the specified format version.
* This helps developers to try out new features in an unreleased version or migrate existing tables to a new version.
* <p>
* Note: If you set the Iceberg format to an unreleased version, software stability is not guaranteed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reserved property should not allow setting the format version to anything other than a stable and adopted version.

@@ -90,6 +103,14 @@ static TableMetadata newTableMetadata(Schema schema,
String location,
Map<String, String> properties,
int formatVersion) {
// check if there is format version override in properties
Map<String, String> validProperties = properties;
Copy link
Contributor

@rdblue rdblue Aug 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to rename the method argument (maybe rawProperties) to result in fewer changes and minimize the chance that a reference was missed. Also, if we are introducing reserved table properties, then we should make this more generic, like this:

    Map<String, String> requestedProperties = rawProperties.entrySet().stream()
        .filter(entry -> !RESERVED_PROPERTIES.contains(entry.getKey()))
        .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
    int formatVersion = PropertyUtil.propertyAsInt(
        rawProperties, PROPERTY_FORMAT_VERSION, DEFAULT_TABLE_FORMAT_VERSION);

@@ -90,6 +103,14 @@ static TableMetadata newTableMetadata(Schema schema,
String location,
Map<String, String> properties,
int formatVersion) {
// check if there is format version override in properties
Map<String, String> validProperties = properties;
int formatVersionOverride = formatVersion;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to handle the conflict between formatVersion and properties. Since all of these methods are internal, we can change them however we like, but we should not have 2 ways to set the table version. I think there are two options:

  1. Set tableVersion from properties in the 2 overrides and validate that properties does not contain format-version here.
  2. Move the implementation to the override above that doesn't accept formatVersion. Then update this override to convert formatVersion to a property and validate that the properties passed here don't contain format-version.

I would go for option 2. That makes handling a set of reserved properties cleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more option I just thought about. I've been considering adding a builder here. We could use a builder and then set some validation so that formatVersion can only be configured once.

snapshotLog, addPreviousFile(file, lastUpdatedMillis, validProperties));

if (formatVersion != newFormatVersion) {
metadata = metadata.upgradeToFormatVersion(newFormatVersion);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit weird to call another instance method to produce a second TableMetadata here. I'm thinking that maybe the builder pattern would work better for complicated cases... Seems fine for now though.

@jackye1995
Copy link
Contributor Author

@openinx thanks for the suggestion, I added tests for Spark, Flink and Hive.

@rdblue I agree with the suggestions for the reserved table properties, I was a bit hesitated about if I should generalize it at this stage or not. If we want to do that, I have moved those static variables to TableProperties which I think is a better place to expose it, please let me know if you agree with this placement or not.

For the builder pattern, yes I absolutely agree that we should have a builder given the number of arguments in the static methods and the variation of inputs. I can do that with a separated PR, and I was hoping this can get into 0.12.0 since we have voted to finalize format v2. It would be a bit awkward if the spec is finalized but people cannot try it out. Please let me know if this is possible. If so I can focus on getting this PR merged first, and then I can do a best effort to get the builder in, which can be before or after 0.12.0.

@@ -57,31 +59,29 @@

private static final long ONE_MINUTE = TimeUnit.MINUTES.toMillis(1);

/**
* @deprecated will be removed in 0.9.0; use newTableMetadata(Schema, PartitionSpec, String, Map) instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for removing this!

@@ -47,6 +48,7 @@
* Metadata for a table.
*/
public class TableMetadata implements Serializable {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unnecessary whitespace change.

Map<String, String> newRawProperties = Maps.newHashMap();
newRawProperties.putAll(this.properties);
newRawProperties.putAll(updatedProperties);
Map<String, String> newProperties = unreservedProperties(newRawProperties);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to construct the properties directly rather than building a map and then rebuilding it. You should be able to use the same filter to add just the non-reserved properties from updatedProperties.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could have unreservedProperties return something else, like a filtered Stream if you wanted to still reuse it.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995, I did a complete review. This is about ready to go, but I think there are a couple of things to fix. Thanks!

@jackye1995
Copy link
Contributor Author

@rdblue fixed based on your comments, should be good for another look

}

public static TableMetadata newTableMetadata(Schema schema,
PartitionSpec spec,
String location,
Map<String, String> properties) {
return newTableMetadata(schema, spec, SortOrder.unsorted(), location, properties, DEFAULT_TABLE_FORMAT_VERSION);
return newTableMetadata(schema, spec, SortOrder.unsorted(), location, unreservedProperties(properties),
PropertyUtil.propertyAsInt(properties, TableProperties.RESERVED_PROPERTY_FORMAT_VERSION,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think defining some helper variables will make it a little bit easier to read. I usually try to avoid splitting code over multiple lines if possible.

SortOrder sortOrder = SortOrder.unsorted();
int formatVersion = PropertyUtil.propertyAsInt(properties,
    TableProperties.RESERVED_PROPERTY_FORMAT_VERSION,
    DEFAULT_TABLE_FORMAT_VERSION);
return newTableMetadata(schema, spec, sortOrder, location, unreservedProperties(properties), formatVersion); 

* <p>
* Note: incomplete or unstable versions cannot be selected using this property.
*/
public static final String RESERVED_PROPERTY_FORMAT_VERSION = "format-version";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we add the RESERVED_PROPERTY_ prefix to be explicit? I am wondering whether it is worth making the var name longer as we have RESERVED_PROPERTIES below.

Just asking. I am fine either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good point, I had the prefix because there was no RESERVED_PROPERTIES set. But now it becomes a bit redundant.

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rdblue
Copy link
Contributor

rdblue commented Aug 3, 2021

The latest changes are only to Javadoc, so I'm going to merge this based on the test result for 34094f2.

@rdblue rdblue merged commit 0b2f065 into apache:master Aug 3, 2021
chenjunjiedada pushed a commit to chenjunjiedada/incubator-iceberg that referenced this pull request Oct 20, 2021
Merge remote-tracking branch 'upstream/merge-master-20210816' into master
## 该MR主要解决什么?

merge upstream/master,引入最近的一些bugFix和优化

## 该MR的修改是什么?

核心关注PR:
> Predicate PushDown 支持,https://github.com/apache/iceberg/pull/2358, https://github.com/apache/iceberg/pull/2926, https://github.com/apache/iceberg/pull/2777/files
> Spark场景写入空dataset 报错问题,直接skip掉即可, apache#2960
> Flink UI补充uidPrefix到operator方便跟踪多个iceberg sink任务, apache#288
> Spark 修复nested Struct Pruning问题, apache#2877
> 可以使用Table Properties指定创建v2 format表,apache#2887
> 补充SortRewriteStrategy框架,逐步支持不同rewrite策略, apache#2609 (WIP:apache#2829)
> Spark 为catalog配置hadoop属性支持, apache#2792
> Spark 针对timestamps without timezone读写支持, apache#2757
> Spark MicroBatch支持配置属性skip delete snapshots, apache#2752
> Spark V2 RewriteDatafilesAction 支持
> Core: Add validation for row-level deletes with rewrites, apache#2865 > schema time travel 功能相关,补充schema-id, Core: add schema id to snapshot 
> Spark Extension支持identifier fields操作, apache#2560
> Parquet: Update to 1.12.0, apache#2441
> Hive: Vectorized ORC reads for Hive, apache#2613
> Spark: Add an action to remove all referenced files, apache#2415

## 该MR是如何测试的?

UT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants