Hive: support create table with partition transform through table property #2701

jackye1995 · 2021-06-16T08:00:47Z

Currently many organizations stuck with Hive 2 or 3 and cannot leverage latest Hive features to create Iceberg tables with hidden partitions. This PR provides the following sytax:

CREATE TABLE table (id bigint, category string)
TBLPROPERTIES ('iceberg.partitioning'='bucket(id,16)|category')

As a short term solution for those users to use Iceberg while transitioning to newer Hive versions.

@pvary @yyanyy

marton-bod · 2021-06-21T10:51:01Z

I haven't reviewed the code yet, but the syntax seems to differ from Spark SQL and the new syntax we've recently introduced in upstream Hive.
Here: bucket(id, 16)
Spark SQL/Upstream Hive/Impala: bucket(16, id) (https://iceberg.apache.org/spark/#create-table)
Can we synchronize the syntax in the his PR as well to follow the same pattern?

pvary · 2021-06-21T12:03:16Z

@jackye1995: How are we handling these temporary workarounds in Iceberg? Where do we notify the users that this might not be supported in newer versions?

Thanks,
Peter

yyanyy · 2021-06-21T18:04:43Z

@jackye1995: How are we handling these temporary workarounds in Iceberg? Where do we notify the users that this might not be supported in newer versions?

Thanks,
Peter

I wonder on Hive side, which Hive versions will apache/hive#2333 be merged into? If it's only going to be merged in the latest version (Hive 3.0/4.0?) is it possible to permanently support the syntax raised in this PR in older Hive versions, and in latest versions we reject this table property and only accept them in the proper partitioned by format?

yyanyy · 2021-06-21T18:16:48Z

mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergPartitionTextParser.java

+    AssertHelpers.assertThrows("should fail when input to transform is wrong",
+        ValidationException.class,
+        "Cannot parse 2-arg partition transform from text part",
+        () -> HiveIcebergPartitionTextParser.fromText(SCHEMA, "bucket(8, name)"));


Nit: could add cases like bucket(name) for 2-args transforms, and day(1, created) for 1-arg transform

yyanyy · 2021-06-21T18:17:37Z

mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergPartitionTextParser.java

+          "Cannot parse 2-arg partition transform from text part: %s", trimmedPart);
+
+      String columnName = matcher.group(1).trim();
+      ValidationException.check(schema.findField(columnName) != null,


Nit: I think we can delegate such validation exception to partition spec builder so that we don't need to test ourselves? That has to catch more complicated cases like unexpected column types anyway

Yeah agree, these can always be caught by the partition spec builder, but I was thinking it might be better to fail fast. Let me think about it more for what is the best line to cut here.

jackye1995 · 2021-06-21T20:08:02Z

I haven't reviewed the code yet, but the syntax seems to differ from Spark SQL and the new syntax we've recently introduced in upstream Hive

will fix, thanks for noticing this!

How are we handling these temporary workarounds in Iceberg

Good question, I am considering adding a warning log with a reference to the latest syntax and hive version. Any other thoughts?

pvary · 2021-06-21T22:05:08Z

I wonder on Hive side, which Hive versions will apache/hive#2333 be merged into? If it's only going to be merged in the latest version (Hive 3.0/4.0?) is it possible to permanently support the syntax raised in this PR in older Hive versions, and in latest versions we reject this table property and only accept them in the proper partitioned by format?

In Hive we usually try to keep the feature set stable for old releases and add new features only into the new releases. So my first answer is that Hive 4.0 and onwards will include https://issues.apache.org/jira/browse/HIVE-25179/apache/hive#2333, but none of the 2.x, 3.x will get it.

My main concern is that we create a feature which become widely used and then we can not remove it for the standard solution later. Having a warning message would definitely help, so the users will be notified that this is only a temporary solution 😄

Thanks,
Peter

jackye1995 · 2021-06-22T03:37:42Z

@pvary warning added, and @marton-bod arg order fixed, it should be ready for anther review.

@yyanyy I thought about the exceptions, and I decided to

removed the column checks in 1 arg and 2 arg transforms
use illegal argument exception for all exceptions thrown in the parser, so they all throw a single type of exception
keep the column name check for the last identity transform case just to provide a more informative message

jackye1995 · 2021-06-22T06:53:57Z

restart flaky tests

…perty

pvary · 2021-06-23T08:56:21Z

mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergPartitionTextParser.java

+    Preconditions.checkArgument(text.length() < PARTITION_TEXT_MAX_LENGTH,
+        "Partition spec text too long: max allowed length %s, but got %s",
+        PARTITION_TEXT_MAX_LENGTH, text.length());
+    String[] parts = text.split(PIPE_DELIMITER);


As @marton-bod pointed out in one of our discussions, there is a possibility to have columns with special characters in the column names, like ,| etc. See: HIVE-25222 - Fix reading Iceberg tables with a comma in column names.

This could further complicate the parsing, and I am not sure it is worth it, but this is definitely something that the users should be aware.

One other thing to consider is that users will be used to the , comma delimiter as the syntax from Spark SQL/Hive4/Impala, e.g. partitioned by (bucket(16, id), category), so the pipe could feel unnatural.
I know this would complicate the parsing logic, but something to consider also.

thanks for the suggestion, I have updated to allow an alternative delimiter, similar to the approach taken in HIVE-25222.

Regarding using , instead of |, yes | is chosen mainly to avoid complexity and allow the use of string.split instead of complex parsing logic. Because this is not a long term solution, and other systems do not really have examples of parsing a delimited list of expressions, I think it is relatively reasonable to use |. Please let me know if it is not enough.

pvary · 2021-06-24T08:32:32Z

mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergPartitionTextParser.java

+    PartitionSpec expected = PartitionSpec.builderFor(schema)
+        .bucket("i|d", 16).alwaysNull("na,me").build();
+    Assert.assertEquals(expected, HiveIcebergPartitionTextParser.fromText(
+        schema, "bucket(16,i|d);alwaysNull(na,me)", ";"));


what happens in this case:

Schema schema = new Schema( optional(1, "i,d", Types.LongType.get()), optional(2, "na|me", Types.StringType.get())); PartitionSpec expected = PartitionSpec.builderFor(schema) .bucket("i,d", 16).alwaysNull("na|me").build(); Assert.assertEquals(expected, HiveIcebergPartitionTextParser.fromText( schema, "bucket(16,i,d);alwaysNull(na|me)", ";"));

pvary · 2021-06-26T15:53:06Z

@jackye1995: I prefer to have a complete solution even for a temporary feature, but if it is too much effort I am not against to accept some compromise.

I will be OOO for 2 weeks. Please be patient, or try to find another reviewer if it is important on your end to push this change.

Thanks, and sorry for the inconvenience.
Peter

jackye1995 · 2021-06-29T07:59:52Z

@marton-bod @pvary I think Peter raised a good point that the current solution by replacing delimiter does not solve the issue because we have 2 delimiters to escape in this case. After thinking for a while, I think the best way to go is to not go with this approach and use backquote to escape column names. This also fits better with the Spark SQL specification for column name. I have completely rewrote the parser to do character by character parsing instead of using simply a string.split. Please let me know if there is any case not covered here, thanks!

yyanyy

For the two delimiters to escape, do you mean between partition fields and within each field's definition (on column name)? I wonder if we can continue to do the regex matching with some changes that account for the delimiters inside since char by char parsing may be more complicated and error prone. The major thing regex matching wouldn't handle is when user has '(' or ')' in their field name, but I'm not sure if that's even allowed (and if it is, we may want to add additional test case here too)

yyanyy · 2021-06-29T19:08:32Z

mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergConfigTextParser.java

+          transformPart = sb.toString();
+          sb.delete(0, sb.length());
+          break;
+        case ')':


We might want to do some sanity check for ')' if we go with char by char parsing, since looks like currently these three cases would pass but ideally they shouldn't:

day(created|employee_info.employer

day(created|employee_info.employer)

day(created|employee_info.employer)|bucket(16,id

yyanyy · 2021-06-29T19:28:01Z

mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergConfigTextParser.java

+  @Test
+  public void testEscapedColumns() {
+    Schema schema = new Schema(
+        optional(1, "i|d", Types.LongType.get()),


minor: might want to add an additional case to test other special characters like ";" in column name. Looks like currently we don't require escaping them (e.g. bucket(16,i;d) works), I think it might be fine but not sure if we want to make things consistent

jackye1995 · 2021-06-29T21:44:27Z

@yyanyy there are 2 things we have to address here:

parsing columns with , in transform, such as bucket(16, i,d)
parsing columns with |: such as col1|co|l2

I don't see a good way to use a hybrid approach of regex + character parsing to address this and keep the user experience consistent.

Based on what Spark has, since we are already using a character parsing approach, it might be better to directly use , as delimiter, and also support the transform AS fieldName syntax, so we can specify something like bucket(16,id) AS shard, category. I will update based on this.

jackye1995 · 2021-07-12T22:18:51Z

@pvary Hi Peter, I left this PR not touched just to get your feedback before I publish the newer version. The current idea is that I can implement this char-by-char parser and directly allow inputs in the format like bucket(16, id) AS shard, category, which will be exactly the same with the Spark SQL input format. Please let me know if you are okay with this approach, if so I will post the updated version.

pvary · 2021-07-13T10:33:49Z

Hi @jackye1995!

I am honestly not sure how much effort do we want to put into this change and where do we stop.

This might be interesting from here:

Any column name that is specified within backticks (`) is treated literally. Within a backtick string, use double backticks (``) to represent a backtick character. Backtick quotation also enables the use of reserved keywords for table and column identifiers.

Since this is intended as a temporary solution I think we should keep it as simple as possible and we should clearly state what is supported, and what is not. Really depends on the intended use-case for you.

jackye1995 · 2021-08-18T05:36:58Z

I am closing this PR. We had mixed feedback both internally and in open source around this approach. After experimenting internally a bit, I think it does discourage people from upgrading to a higher Hive version, and provide a backdoor for people to stick with the old syntax. Please let me know if anyone thinks otherwise, thanks for all the reviews.

github-actions bot added the MR label Jun 16, 2021

yyanyy reviewed Jun 21, 2021

View reviewed changes

jackye1995 force-pushed the hive-partition-text branch from 760fc45 to 4c4d683 Compare June 22, 2021 03:34

github-actions bot added the hive label Jun 22, 2021

jackye1995 closed this Jun 22, 2021

jackye1995 reopened this Jun 22, 2021

Hive: support create table with partition transform through table pro…

ebc3a5c

…perty

jackye1995 force-pushed the hive-partition-text branch from 1107767 to ebc3a5c Compare June 22, 2021 18:51

fix comments

87e19d8

pvary reviewed Jun 23, 2021

View reviewed changes

allow alternative delimiter

9b5d02c

pvary reviewed Jun 24, 2021

View reviewed changes

use parser with quote support

520feab

fix spacing

ab5f2ea

yyanyy reviewed Jun 29, 2021

View reviewed changes

jackye1995 closed this Aug 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hive: support create table with partition transform through table property #2701

Hive: support create table with partition transform through table property #2701

jackye1995 commented Jun 16, 2021

marton-bod commented Jun 21, 2021

pvary commented Jun 21, 2021

yyanyy commented Jun 21, 2021

yyanyy Jun 21, 2021

yyanyy Jun 21, 2021

jackye1995 Jun 21, 2021

jackye1995 commented Jun 21, 2021

pvary commented Jun 21, 2021

jackye1995 commented Jun 22, 2021

jackye1995 commented Jun 22, 2021

pvary Jun 23, 2021

pvary Jun 23, 2021

marton-bod Jun 23, 2021 •

edited

Loading

jackye1995 Jun 24, 2021

pvary Jun 24, 2021

pvary commented Jun 26, 2021

jackye1995 commented Jun 29, 2021

yyanyy left a comment

yyanyy Jun 29, 2021

yyanyy Jun 29, 2021

jackye1995 commented Jun 29, 2021 •

edited

Loading

jackye1995 commented Jul 12, 2021

pvary commented Jul 13, 2021 •

edited

Loading

jackye1995 commented Aug 18, 2021

Hive: support create table with partition transform through table property #2701

Hive: support create table with partition transform through table property #2701

Conversation

jackye1995 commented Jun 16, 2021

marton-bod commented Jun 21, 2021

pvary commented Jun 21, 2021

yyanyy commented Jun 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackye1995 commented Jun 21, 2021

pvary commented Jun 21, 2021

jackye1995 commented Jun 22, 2021

jackye1995 commented Jun 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marton-bod Jun 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pvary commented Jun 26, 2021

jackye1995 commented Jun 29, 2021

yyanyy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackye1995 commented Jun 29, 2021 • edited Loading

jackye1995 commented Jul 12, 2021

pvary commented Jul 13, 2021 • edited Loading

jackye1995 commented Aug 18, 2021

marton-bod Jun 23, 2021 •

edited

Loading

jackye1995 commented Jun 29, 2021 •

edited

Loading

pvary commented Jul 13, 2021 •

edited

Loading