-
Notifications
You must be signed in to change notification settings - Fork 28.5k
SPARK-2686 Add Length and OctetLen support to Spark SQL #1586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can one of the admins verify this patch? |
override def foldable = child.foldable | ||
def nullable = child.nullable | ||
|
||
override def eval(input: Row): EvaluatedType = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just put this in Strlen
since that is the only place it is used.
test this please |
Substring(nodeToExpr(string), nodeToExpr(pos), nodeToExpr(length)) | ||
case Token("TOK_FUNCTION", Token(LENGTH(), Nil) :: string :: Nil) => | ||
Length(nodeToExpr(string)) | ||
// case Token("TOK_FUNCTION", Token(STRLEN(), Nil) :: string :: Nil) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
Thanks for doing this! A few minor comments. |
QA tests have started for PR 1586. This patch merges cleanly. |
QA results for PR 1586: |
BTW, there are also a few style errors. You can find them locally by running |
Thanks for the review Michael! I agree with / will apply all of your comments and will re-run with sbt scalastyle . Question: from https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17181/consoleFull there is a message in the jenkins output saying that unit tests failed. But I can not find any information on which failed tests. (I had run and re-run the sql/core and sql/catalyst tests before submitting the PR and they were passing) |
It doesn't get to unit tests if the style check fails.
|
After a fair bit of struggling with testing inconsistencies and maven and git, I have the updates in place. Please take a look whenever you have a chance - no rush ;) |
test this please |
QA tests have started for PR 1586. This patch merges cleanly. |
QA results for PR 1586: |
The updated code got caught by one of the cases in the Hive compatibility suite. The Hive UDF length calculation appears to be different than the new one implemented, presumably due to differences in handling of character encoding. For the fix: I will make the length() function use the same character encoding as does Hive to keep it compatible. The strlen() method will be the "outlet" to permit flexible handling of multi byte character sets in the general RDD (no strlen method is defined in hive proper). I am going to roll back just the hive portion of the commit, and will report back end of evening. udf_length *** FAILED *** |
override def eval(input: Row): EvaluatedType = { | ||
val string = child.eval(input) | ||
if (string == null) { | ||
null.asInstanceOf[DataType] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I think asInstanceOf[DataType]
is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ueshin, I will try it the way you suggest here.
That's very useful feature of getting the string length for different character set. Since most of code are quite similar between |
@chenghao-intel Let us keep both length and strlen: they both serve different purposes. The length operation may be applied to any datatype. The tests show examples of finding the "select max(length(s)) from testData" |
All tests passing mvn -Pyarn -Phadoop-2.3 -Phive test INFO] ------------------------------------------------------------------------ |
test this please |
QA tests have started for PR 1586. This patch merges cleanly. |
if (string == null) { | ||
null | ||
} else if (!string.isInstanceOf[String]) { | ||
throw new IllegalArgumentException(s"Non-string value [$string] provided to strlen") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marmbrus do you think it's more reasonable to put the children data type checking into the resolved
? I am not sure if we do the right thing for the existed expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@javadba I think you probably also need to update rules in the HiveTypeCoercion
, which will insert the expression Cast
if the child expression type are not satisfied, and then you won't need the children data type checking here.
I have narrowed the problem down to the SQLParser. I will update when the precise cause is determined, likely within the hour. |
Surprising result here: the following change makes this work: StackOverflowError:
Works fine:
Also works fine:
Let's double check - make sure this really repeatably fails on "OCTET_LENGTH": And Yes! It does fail again with OCTET_LENGTH. We have a clear test failure scenario. So here we are we need to do OCTET/CHAR_LEN and NOT OCTET/CHAR_LENGTH - until the root cause of this unrelated parser bug is found! Should I open a separate JIRA for the parser bug? BTW my theory is that there is something happening when one KEYWORD contains another KEYWORD. But OTOH the LEN keyword is not causing an issue. So this is a subtle case to understand |
…isting spark ParserCombinator bug.
The rename change was committed/pushed and the most germane tests pass. I am re-running full regression. One thing I have noticed already: the flume-sink external project is failing - looks to be unrelated to any of my work. But I am looking into it. |
Hi, Here is the test that failed and then the overall results:
So I am not presently in a position to run regression tests - given the overall runtime will be doulbe-digit hours. Would someone please run Jenkins on this code? |
test this please |
QA tests have started for PR 1586. This patch merges cleanly. |
QA results for PR 1586: |
@ueshin I repeatably verified that simply changing "OCTET_LEN" to "OCTET_LENGTH" ended up causing SOF. By "repeatably" I mean:
i have been able to demonstrate this multiple times. Now the regression tests have been run against the modified and reliable code. Please re-run your tests in a fresh area. I will do the same .. but i am hesitant to consider to revert because we have positive test results now with the latest commit (as well as my results of the problem before the commit). |
I have git clone'd to a completely new area, and I reverted my last commit.
I get precisely the same error:
Now, let's revert the revert :
Now (with the latest commit re-applied) those three test sutes pass again (specifically HiveQuerySuite did not fail) And .. just to be extra sure here- that we can toggle between pass/fail arbitrary # of times:
And once again (with the last commit reverted) the HiveQuerySuite fails with the same error. Oh ok, let's revert yet again.. So I did yet another revert, and yes the tests PASS again. BTW after each of these reverts, I manually viewed the SparkSQL.scala file after each revert to ensure we are seeing the expected version of "OCTET_LENGTH" vs "OCTET_LEN". Specifically the "OCTET_LENGTH" always results in the SOF failure and the "OCTET_LEN" always results in the HiveQLSuite passing. So I have established clearly the following: The steps I showed above should be available for anyone to repeat to see themselves. |
@javadba Thanks for detail. |
@javadba, @marmbrus I'd like to know the result of Jenkins build without the last commit 22eddbc. |
Sorry for the delay. We are a little swamped with the 1.1 release. I will trigger Jenkins. If we are still having issues with the SQL parser its probably okay to leave that out. We are hoping to overhaul that codepath in the near future anyway. Also, we are going to need to block this PR on https://issues.apache.org/jira/browse/SPARK-2863 |
add to whitelist |
QA tests have started for PR 1586. This patch merges cleanly. |
@marmbrus I am fine with delays on this - I just was unclear as to whether there some expectation on action on my part. Overall this is a minor enhancement but it has generated a non-negiglible amount of interaction and effort on the part of yourself and /the reviewers. I can understand if this were put on hold. |
QA results for PR 1586: |
Hey @javadba, thanks for all the work on this, especially the time figuring out the surprisingly complicated semantics for this function. Also, sorry for the delay with review/merging! I'd love to add this, but right now I'm concerned that the way we are adding UDFs is unsustainable. I've written up some thoughts on the right way to proceed in SPARK-4867. Since I'm trying really hard to keep the PR queue small (mostly to help avoid PRs that languish like this one has been), I propose we close this issue for now and reopen once the UDF framework has been updated. I've linked your issue to that one so you or others can use this code as a starting point. |
OK Michael thanks for the update. 2014-12-17 11:21 GMT-08:00 Michael Armbrust notifications@github.com:
|
Mind closing this manually? Our script seems to be missing it. |
QA tests have started for PR 1586. This patch DID NOT merge cleanly! |
QA results for PR 1586: |
Test FAILed. |
Syntactic, parsing, and operational support have been added for LEN(GTH) and OCTET_LEN functions.
Examples:
SQL:
import org.apache.spark.sql._
case class TestData(key: Int, value: String)
val sqlc = new SQLContext(sc)
import sqlc._
val testData: SchemaRDD = sqlc.sparkContext.parallelize(
(1 to 100).map(i => TestData(i, i.toString)))
testData.registerAsTable("testData")
sqlc.sql("select length(key) as key_len from testData order by key_len desc limit 5").collect
res12: Array[org.apache.spark.sql.Row] = Array([3], [2], [2], [2], [2])
HQL:
val hc = new org.apache.spark.sql.hive.HiveContext(sc)
import hc._
hc.hql
hql("select length(grp) from simplex").collect
res14: Array[org.apache.spark.sql.Row] = Array([6], [6], [6], [6])
As far as codebase changes: they have been purposefully made similar to the ones made for for adding SUBSTR(ING) from July 17:
SQLParser, Optimizer, Expression, stringOperations, and HiveQL were the main classes changed. The testing suites affected are ConstantFolding and ExpressionEvaluation.