-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hive: support create table with partition transform through table property #2701
Conversation
I haven't reviewed the code yet, but the syntax seems to differ from Spark SQL and the new syntax we've recently introduced in upstream Hive. |
@jackye1995: How are we handling these temporary workarounds in Iceberg? Where do we notify the users that this might not be supported in newer versions? Thanks, |
I wonder on Hive side, which Hive versions will apache/hive#2333 be merged into? If it's only going to be merged in the latest version (Hive 3.0/4.0?) is it possible to permanently support the syntax raised in this PR in older Hive versions, and in latest versions we reject this table property and only accept them in the proper |
AssertHelpers.assertThrows("should fail when input to transform is wrong", | ||
ValidationException.class, | ||
"Cannot parse 2-arg partition transform from text part", | ||
() -> HiveIcebergPartitionTextParser.fromText(SCHEMA, "bucket(8, name)")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could add cases like bucket(name)
for 2-args transforms, and day(1, created)
for 1-arg transform
"Cannot parse 2-arg partition transform from text part: %s", trimmedPart); | ||
|
||
String columnName = matcher.group(1).trim(); | ||
ValidationException.check(schema.findField(columnName) != null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think we can delegate such validation exception to partition spec builder so that we don't need to test ourselves? That has to catch more complicated cases like unexpected column types anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agree, these can always be caught by the partition spec builder, but I was thinking it might be better to fail fast. Let me think about it more for what is the best line to cut here.
will fix, thanks for noticing this!
Good question, I am considering adding a warning log with a reference to the latest syntax and hive version. Any other thoughts? |
In Hive we usually try to keep the feature set stable for old releases and add new features only into the new releases. So my first answer is that Hive 4.0 and onwards will include https://issues.apache.org/jira/browse/HIVE-25179/apache/hive#2333, but none of the 2.x, 3.x will get it. My main concern is that we create a feature which become widely used and then we can not remove it for the Thanks, |
760fc45
to
4c4d683
Compare
@pvary warning added, and @marton-bod arg order fixed, it should be ready for anther review. @yyanyy I thought about the exceptions, and I decided to
|
restart flaky tests |
1107767
to
ebc3a5c
Compare
Preconditions.checkArgument(text.length() < PARTITION_TEXT_MAX_LENGTH, | ||
"Partition spec text too long: max allowed length %s, but got %s", | ||
PARTITION_TEXT_MAX_LENGTH, text.length()); | ||
String[] parts = text.split(PIPE_DELIMITER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @marton-bod pointed out in one of our discussions, there is a possibility to have columns with special characters in the column names, like ,|
etc. See: HIVE-25222 - Fix reading Iceberg tables with a comma in column names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could further complicate the parsing, and I am not sure it is worth it, but this is definitely something that the users should be aware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other thing to consider is that users will be used to the ,
comma delimiter as the syntax from Spark SQL/Hive4/Impala, e.g. partitioned by (bucket(16, id), category)
, so the pipe could feel unnatural.
I know this would complicate the parsing logic, but something to consider also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion, I have updated to allow an alternative delimiter, similar to the approach taken in HIVE-25222.
Regarding using ,
instead of |
, yes |
is chosen mainly to avoid complexity and allow the use of string.split
instead of complex parsing logic. Because this is not a long term solution, and other systems do not really have examples of parsing a delimited list of expressions, I think it is relatively reasonable to use |
. Please let me know if it is not enough.
PartitionSpec expected = PartitionSpec.builderFor(schema) | ||
.bucket("i|d", 16).alwaysNull("na,me").build(); | ||
Assert.assertEquals(expected, HiveIcebergPartitionTextParser.fromText( | ||
schema, "bucket(16,i|d);alwaysNull(na,me)", ";")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens in this case:
Schema schema = new Schema(
optional(1, "i,d", Types.LongType.get()),
optional(2, "na|me", Types.StringType.get()));
PartitionSpec expected = PartitionSpec.builderFor(schema)
.bucket("i,d", 16).alwaysNull("na|me").build();
Assert.assertEquals(expected, HiveIcebergPartitionTextParser.fromText(
schema, "bucket(16,i,d);alwaysNull(na|me)", ";"));
@jackye1995: I prefer to have a complete solution even for a temporary feature, but if it is too much effort I am not against to accept some compromise. I will be OOO for 2 weeks. Please be patient, or try to find another reviewer if it is important on your end to push this change. Thanks, and sorry for the inconvenience. |
@marton-bod @pvary I think Peter raised a good point that the current solution by replacing delimiter does not solve the issue because we have 2 delimiters to escape in this case. After thinking for a while, I think the best way to go is to not go with this approach and use backquote to escape column names. This also fits better with the Spark SQL specification for column name. I have completely rewrote the parser to do character by character parsing instead of using simply a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the two delimiters to escape, do you mean between partition fields and within each field's definition (on column name)? I wonder if we can continue to do the regex matching with some changes that account for the delimiters inside since char by char parsing may be more complicated and error prone. The major thing regex matching wouldn't handle is when user has '(' or ')' in their field name, but I'm not sure if that's even allowed (and if it is, we may want to add additional test case here too)
transformPart = sb.toString(); | ||
sb.delete(0, sb.length()); | ||
break; | ||
case ')': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to do some sanity check for ')' if we go with char by char parsing, since looks like currently these three cases would pass but ideally they shouldn't:
day(created|employee_info.employer
day(created|employee_info.employer)
day(created|employee_info.employer)|bucket(16,id
@Test | ||
public void testEscapedColumns() { | ||
Schema schema = new Schema( | ||
optional(1, "i|d", Types.LongType.get()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: might want to add an additional case to test other special characters like ";" in column name. Looks like currently we don't require escaping them (e.g. bucket(16,i;d)
works), I think it might be fine but not sure if we want to make things consistent
@yyanyy there are 2 things we have to address here:
I don't see a good way to use a hybrid approach of regex + character parsing to address this and keep the user experience consistent. Based on what Spark has, since we are already using a character parsing approach, it might be better to directly use |
@pvary Hi Peter, I left this PR not touched just to get your feedback before I publish the newer version. The current idea is that I can implement this char-by-char parser and directly allow inputs in the format like |
Hi @jackye1995! I am honestly not sure how much effort do we want to put into this change and where do we stop. This might be interesting from here:
Since this is intended as a temporary solution I think we should keep it as simple as possible and we should clearly state what is supported, and what is not. Really depends on the intended use-case for you. |
I am closing this PR. We had mixed feedback both internally and in open source around this approach. After experimenting internally a bit, I think it does discourage people from upgrading to a higher Hive version, and provide a backdoor for people to stick with the old syntax. Please let me know if anyone thinks otherwise, thanks for all the reviews. |
fixes #2681
Currently many organizations stuck with Hive 2 or 3 and cannot leverage latest Hive features to create Iceberg tables with hidden partitions. This PR provides the following sytax:
As a short term solution for those users to use Iceberg while transitioning to newer Hive versions.
@pvary @yyanyy