Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive: support create table with partition transform through table property #2701

Closed
wants to merge 5 commits into from

Conversation

jackye1995
Copy link
Contributor

fixes #2681

Currently many organizations stuck with Hive 2 or 3 and cannot leverage latest Hive features to create Iceberg tables with hidden partitions. This PR provides the following sytax:

CREATE TABLE table (id bigint, category string)
TBLPROPERTIES ('iceberg.partitioning'='bucket(id,16)|category')

As a short term solution for those users to use Iceberg while transitioning to newer Hive versions.

@pvary @yyanyy

@github-actions github-actions bot added the MR label Jun 16, 2021
@marton-bod
Copy link
Collaborator

I haven't reviewed the code yet, but the syntax seems to differ from Spark SQL and the new syntax we've recently introduced in upstream Hive.
Here: bucket(id, 16)
Spark SQL/Upstream Hive/Impala: bucket(16, id) (https://iceberg.apache.org/spark/#create-table)
Can we synchronize the syntax in the his PR as well to follow the same pattern?

@pvary
Copy link
Contributor

pvary commented Jun 21, 2021

@jackye1995: How are we handling these temporary workarounds in Iceberg? Where do we notify the users that this might not be supported in newer versions?

Thanks,
Peter

@yyanyy
Copy link
Contributor

yyanyy commented Jun 21, 2021

@jackye1995: How are we handling these temporary workarounds in Iceberg? Where do we notify the users that this might not be supported in newer versions?

Thanks,
Peter

I wonder on Hive side, which Hive versions will apache/hive#2333 be merged into? If it's only going to be merged in the latest version (Hive 3.0/4.0?) is it possible to permanently support the syntax raised in this PR in older Hive versions, and in latest versions we reject this table property and only accept them in the proper partitioned by format?

AssertHelpers.assertThrows("should fail when input to transform is wrong",
ValidationException.class,
"Cannot parse 2-arg partition transform from text part",
() -> HiveIcebergPartitionTextParser.fromText(SCHEMA, "bucket(8, name)"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could add cases like bucket(name) for 2-args transforms, and day(1, created) for 1-arg transform

"Cannot parse 2-arg partition transform from text part: %s", trimmedPart);

String columnName = matcher.group(1).trim();
ValidationException.check(schema.findField(columnName) != null,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think we can delegate such validation exception to partition spec builder so that we don't need to test ourselves? That has to catch more complicated cases like unexpected column types anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agree, these can always be caught by the partition spec builder, but I was thinking it might be better to fail fast. Let me think about it more for what is the best line to cut here.

@jackye1995
Copy link
Contributor Author

I haven't reviewed the code yet, but the syntax seems to differ from Spark SQL and the new syntax we've recently introduced in upstream Hive

will fix, thanks for noticing this!

How are we handling these temporary workarounds in Iceberg

Good question, I am considering adding a warning log with a reference to the latest syntax and hive version. Any other thoughts?

@pvary
Copy link
Contributor

pvary commented Jun 21, 2021

I wonder on Hive side, which Hive versions will apache/hive#2333 be merged into? If it's only going to be merged in the latest version (Hive 3.0/4.0?) is it possible to permanently support the syntax raised in this PR in older Hive versions, and in latest versions we reject this table property and only accept them in the proper partitioned by format?

In Hive we usually try to keep the feature set stable for old releases and add new features only into the new releases. So my first answer is that Hive 4.0 and onwards will include https://issues.apache.org/jira/browse/HIVE-25179/apache/hive#2333, but none of the 2.x, 3.x will get it.

My main concern is that we create a feature which become widely used and then we can not remove it for the standard solution later. Having a warning message would definitely help, so the users will be notified that this is only a temporary solution 😄

Thanks,
Peter

@jackye1995 jackye1995 force-pushed the hive-partition-text branch from 760fc45 to 4c4d683 Compare June 22, 2021 03:34
@jackye1995
Copy link
Contributor Author

@pvary warning added, and @marton-bod arg order fixed, it should be ready for anther review.

@yyanyy I thought about the exceptions, and I decided to

  1. removed the column checks in 1 arg and 2 arg transforms
  2. use illegal argument exception for all exceptions thrown in the parser, so they all throw a single type of exception
  3. keep the column name check for the last identity transform case just to provide a more informative message

@github-actions github-actions bot added the hive label Jun 22, 2021
@jackye1995
Copy link
Contributor Author

restart flaky tests

@jackye1995 jackye1995 closed this Jun 22, 2021
@jackye1995 jackye1995 reopened this Jun 22, 2021
@jackye1995 jackye1995 force-pushed the hive-partition-text branch from 1107767 to ebc3a5c Compare June 22, 2021 18:51
Preconditions.checkArgument(text.length() < PARTITION_TEXT_MAX_LENGTH,
"Partition spec text too long: max allowed length %s, but got %s",
PARTITION_TEXT_MAX_LENGTH, text.length());
String[] parts = text.split(PIPE_DELIMITER);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @marton-bod pointed out in one of our discussions, there is a possibility to have columns with special characters in the column names, like ,| etc. See: HIVE-25222 - Fix reading Iceberg tables with a comma in column names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could further complicate the parsing, and I am not sure it is worth it, but this is definitely something that the users should be aware.

Copy link
Collaborator

@marton-bod marton-bod Jun 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thing to consider is that users will be used to the , comma delimiter as the syntax from Spark SQL/Hive4/Impala, e.g. partitioned by (bucket(16, id), category), so the pipe could feel unnatural.
I know this would complicate the parsing logic, but something to consider also.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion, I have updated to allow an alternative delimiter, similar to the approach taken in HIVE-25222.

Regarding using , instead of |, yes | is chosen mainly to avoid complexity and allow the use of string.split instead of complex parsing logic. Because this is not a long term solution, and other systems do not really have examples of parsing a delimited list of expressions, I think it is relatively reasonable to use |. Please let me know if it is not enough.

PartitionSpec expected = PartitionSpec.builderFor(schema)
.bucket("i|d", 16).alwaysNull("na,me").build();
Assert.assertEquals(expected, HiveIcebergPartitionTextParser.fromText(
schema, "bucket(16,i|d);alwaysNull(na,me)", ";"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens in this case:

    Schema schema = new Schema(
        optional(1, "i,d", Types.LongType.get()),
        optional(2, "na|me", Types.StringType.get()));
    PartitionSpec expected = PartitionSpec.builderFor(schema)
        .bucket("i,d", 16).alwaysNull("na|me").build();
    Assert.assertEquals(expected, HiveIcebergPartitionTextParser.fromText(
        schema, "bucket(16,i,d);alwaysNull(na|me)", ";"));

@pvary
Copy link
Contributor

pvary commented Jun 26, 2021

@jackye1995: I prefer to have a complete solution even for a temporary feature, but if it is too much effort I am not against to accept some compromise.

I will be OOO for 2 weeks. Please be patient, or try to find another reviewer if it is important on your end to push this change.

Thanks, and sorry for the inconvenience.
Peter

@jackye1995
Copy link
Contributor Author

@marton-bod @pvary I think Peter raised a good point that the current solution by replacing delimiter does not solve the issue because we have 2 delimiters to escape in this case. After thinking for a while, I think the best way to go is to not go with this approach and use backquote to escape column names. This also fits better with the Spark SQL specification for column name. I have completely rewrote the parser to do character by character parsing instead of using simply a string.split. Please let me know if there is any case not covered here, thanks!

Copy link
Contributor

@yyanyy yyanyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the two delimiters to escape, do you mean between partition fields and within each field's definition (on column name)? I wonder if we can continue to do the regex matching with some changes that account for the delimiters inside since char by char parsing may be more complicated and error prone. The major thing regex matching wouldn't handle is when user has '(' or ')' in their field name, but I'm not sure if that's even allowed (and if it is, we may want to add additional test case here too)

transformPart = sb.toString();
sb.delete(0, sb.length());
break;
case ')':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to do some sanity check for ')' if we go with char by char parsing, since looks like currently these three cases would pass but ideally they shouldn't:

  • day(created|employee_info.employer
  • day(created|employee_info.employer)
  • day(created|employee_info.employer)|bucket(16,id

@Test
public void testEscapedColumns() {
Schema schema = new Schema(
optional(1, "i|d", Types.LongType.get()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: might want to add an additional case to test other special characters like ";" in column name. Looks like currently we don't require escaping them (e.g. bucket(16,i;d) works), I think it might be fine but not sure if we want to make things consistent

@jackye1995
Copy link
Contributor Author

jackye1995 commented Jun 29, 2021

@yyanyy there are 2 things we have to address here:

  1. parsing columns with , in transform, such as bucket(16, i,d)
  2. parsing columns with |: such as col1|co|l2

I don't see a good way to use a hybrid approach of regex + character parsing to address this and keep the user experience consistent.

Based on what Spark has, since we are already using a character parsing approach, it might be better to directly use , as delimiter, and also support the transform AS fieldName syntax, so we can specify something like bucket(16,id) AS shard, category. I will update based on this.

@jackye1995
Copy link
Contributor Author

@pvary Hi Peter, I left this PR not touched just to get your feedback before I publish the newer version. The current idea is that I can implement this char-by-char parser and directly allow inputs in the format like bucket(16, id) AS shard, category, which will be exactly the same with the Spark SQL input format. Please let me know if you are okay with this approach, if so I will post the updated version.

@pvary
Copy link
Contributor

pvary commented Jul 13, 2021

Hi @jackye1995!

I am honestly not sure how much effort do we want to put into this change and where do we stop.

This might be interesting from here:

Any column name that is specified within backticks (`) is treated literally. Within a backtick string, use double backticks (``) to represent a backtick character. Backtick quotation also enables the use of reserved keywords for table and column identifiers.

Since this is intended as a temporary solution I think we should keep it as simple as possible and we should clearly state what is supported, and what is not. Really depends on the intended use-case for you.

@jackye1995
Copy link
Contributor Author

I am closing this PR. We had mixed feedback both internally and in open source around this approach. After experimenting internally a bit, I think it does discourage people from upgrading to a higher Hive version, and provide a backdoor for people to stick with the old syntax. Please let me know if anyone thinks otherwise, thanks for all the reviews.

@jackye1995 jackye1995 closed this Aug 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve hidden partition support for Hive CREATE TABLE
4 participants