-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8186] [SPARK-8187] [SQL] datetime function: date_add, date_sub #6782
Conversation
Test build #34758 has finished for PR 6782 at commit
|
test("date_add") { | ||
checkEvaluation( | ||
DateAdd(Literal(Date.valueOf("2016-02-28")), Literal(1)), | ||
DateUtils.fromJavaDate(Date.valueOf("2016-02-29")), create_row(null)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Could we just pass an Row.empty
instead of constructing a new one? Could that just be a default argument to checkEvaluation
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be a good idea.
Test build #35068 has finished for PR 6782 at commit
|
Test build #35072 has finished for PR 6782 at commit
|
@@ -946,6 +946,38 @@ object functions { | |||
def cosh(columnName: String): Column = cosh(Column(columnName)) | |||
|
|||
/** | |||
* Adds a number of days to startDate, given values for startDate and days | |||
* | |||
* @group datetime_funcs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add datetime_funcs
in the beginning of the functions.scala
also.
LGTM except some minor issues. |
Test build #35091 has finished for PR 6782 at commit
|
cc @rxin |
import org.apache.spark.sql.types._ | ||
import org.apache.spark.unsafe.types.UTF8String | ||
|
||
case class DateAdd(startDate: Expression, days: Expression) extends Expression { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it'd be helpful to define the semantics here (basically what eval is doing)
Can you add the code gen version? Otherwise these expressions are pretty slow with so many branches ... |
Yes,we need to move out the dynamic dataType comparison from the |
Test build #35113 has finished for PR 6782 at commit
|
} | ||
} | ||
|
||
override def dataType: DataType = StringType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this return a string instead of a date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hive doc says it returns a stirng
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we follow hive so strictly? It makes more sense for me to have a DateType
return type, so that we can apply date time functions on the result without casting to String and back again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually hive support these functions before there are date type support. to be compatible with older version(older than 0.12.0, where date type are first supported) and applications based on those, keeping the same interface is helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general though we cast back to a string whenever you need to. From an efficiency stand point it seems much better to keep it a date.
/cc @rxin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for DateType, unless we committed to 100% compatible with Hive.
@cloud-fan can you add this to your review queue? |
TypeCheckResult.TypeCheckFailure( | ||
s"type of days expression in DateAdd should be int, not ${days.dataType}.") | ||
} else { | ||
TypeCheckResult.TypeCheckSuccess |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should only allow DateType
and NullType
, and add rules in HiveTypeCoercion
to cast other supported types to DateType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like that way, as we will need to add a rule to cast both timestamp type string and date type string intto DateType. That's very intuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After #6759 gets in, I probably will refract DateUtils to offer some more direct interfaces, but let's skip it now to avoid huge conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agreed with @cloud-fan, that also helps to simplify the implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To do cast in HiveTypeCoercion for about 30 different Datetime Functions? That would be nasty to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adrian-wang Thanks for the confirm. From the HIVE test [1], I see date_add works with tinyint, smallint, int, but there is no case for "2010-01-01 01:02:03", is it not tested side behavior?
[1] https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/udf_date_add.q
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davies you can find more test in
https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/date_udf.q
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for string type of time, the tests are in
https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/timestamp_udf.q
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I checked MySQL, it can cast '2012-01-01 12:21:12' to Date:
mysql> select cast('2012-01-01 12:21:12' as date);
+-------------------------------------+
| cast('2012-01-01 12:21:12' as date) |
+-------------------------------------+
| 2012-01-01 |
+-------------------------------------+
1 row in set (0.00 sec)
This sounds reasonable to me, but not sure making it different than Hive will break real workload or not.
Because of all kinds of wired behavior in Hive, we cannot have 100% compatibility with Hive already, see many hive tests are skipped. @marmbrus, what do yo think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davies agree that 100% compatibility is not a goal and mysql's behavior seems reasonable here. It would be good though to more holistically design our data handling API and make sure we aren't missing anything. For example, today we can handle queries like: WHERE timestamp > "2014-01"
but only because we convert the timestamp to a string and the strings compare lexicographically as you would hope.
However, this feels like a hack to me and it would be much more efficient if we could avoid creating strings for every date and instead could covert "2014-01" into an Int
internally. How does mysql handle partial dates like this?
val resultDays = startDate.dataType match { | ||
case StringType => | ||
DateUtils.millisToDays(DateUtils.stringToTime( | ||
start.asInstanceOf[UTF8String].toString).getTime) + offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Cast
we using DateUtils.fromJavaDate(Date.valueOf(s.toString))
to cast string to date. Which way is more efficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, you handle both date and timestamp string here.
|
|
||
override def dataType: DataType = StringType | ||
|
||
override def toString: String = s"DateAdd($startDate, $days)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this as the case class will handle that pretty well.
Can you remove the WIP from the title? |
TypeCheckResult.TypeCheckFailure( | ||
s"type of startdate expression in DateAdd should be string/timestamp/date," + | ||
s" not ${startDate.dataType}") | ||
} else if (days.dataType != IntegerType && days.dataType != NullType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also support integral types here for days
?
For example, we always infer integer from Python and Json as LongType.
Test build #35646 has finished for PR 6782 at commit
|
Test build #35655 has finished for PR 6782 at commit
|
@adrian-wang I thought about this more -- I think we need to go back to the drawing board to specify the behavior of various date functions, return types, before we can merge these expressions. Some open questions include:
I will make sure the design happens next week. Meantime, can you close the expressions related to dates? We can reopen them once we flush out the design details. |
OK, I'll close these for now. Could you assign them all to me in case duplicated work here? |
No description provided.