Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8186] [SPARK-8187] [SQL] datetime function: date_add, date_sub #6782

Closed
wants to merge 7 commits into from

Conversation

adrian-wang
Copy link
Contributor

No description provided.

@SparkQA
Copy link

SparkQA commented Jun 12, 2015

Test build #34758 has finished for PR 6782 at commit 90b5d2e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class DateAdd(left: Expression, right: Expression)
    • case class DateSub(left: Expression, right: Expression)
    • case class DateDiff(left: Expression, right: Expression)

@adrian-wang adrian-wang changed the title [SPARK-8185] [SPARK-8186] [SPARK-8187] [SQL] datetime function: date_add, date_sub, datediff [SPARK-8185] [SPARK-8186] [SPARK-8187] [SQL] [WIP] datetime function: date_add, date_sub, datediff Jun 12, 2015
test("date_add") {
checkEvaluation(
DateAdd(Literal(Date.valueOf("2016-02-28")), Literal(1)),
DateUtils.fromJavaDate(Date.valueOf("2016-02-29")), create_row(null))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could we just pass an Row.empty instead of constructing a new one? Could that just be a default argument to checkEvaluation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be a good idea.

@adrian-wang adrian-wang changed the title [SPARK-8185] [SPARK-8186] [SPARK-8187] [SQL] [WIP] datetime function: date_add, date_sub, datediff [SPARK-8186] [SPARK-8187] [SQL] [WIP] datetime function: date_add, date_sub Jun 18, 2015
@adrian-wang
Copy link
Contributor Author

@chenghao-intel

@SparkQA
Copy link

SparkQA commented Jun 18, 2015

Test build #35068 has finished for PR 6782 at commit c0703ea.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class DateAdd(startDate: Expression, days: Expression) extends Expression
    • case class DateSub(startDate: Expression, days: Expression) extends Expression

@SparkQA
Copy link

SparkQA commented Jun 18, 2015

Test build #35072 has finished for PR 6782 at commit a5522e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class DateAdd(startDate: Expression, days: Expression) extends Expression
    • case class DateSub(startDate: Expression, days: Expression) extends Expression

@@ -946,6 +946,38 @@ object functions {
def cosh(columnName: String): Column = cosh(Column(columnName))

/**
* Adds a number of days to startDate, given values for startDate and days
*
* @group datetime_funcs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add datetime_funcs in the beginning of the functions.scala also.

@chenghao-intel
Copy link
Contributor

LGTM except some minor issues.

@SparkQA
Copy link

SparkQA commented Jun 18, 2015

Test build #35091 has finished for PR 6782 at commit 81628b5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class DateAdd(startDate: Expression, days: Expression) extends Expression
    • case class DateSub(startDate: Expression, days: Expression) extends Expression

@adrian-wang
Copy link
Contributor Author

cc @rxin
Can you review this? This pr is the first PR for datetime functions and it will created several files in upstream, hence all other datetime functions depend on the creation of these files.

import org.apache.spark.sql.types._
import org.apache.spark.unsafe.types.UTF8String

case class DateAdd(startDate: Expression, days: Expression) extends Expression {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it'd be helpful to define the semantics here (basically what eval is doing)

@rxin
Copy link
Contributor

rxin commented Jun 18, 2015

Can you add the code gen version? Otherwise these expressions are pretty slow with so many branches ...

@chenghao-intel
Copy link
Contributor

Yes,we need to move out the dynamic dataType comparison from the eval.

@SparkQA
Copy link

SparkQA commented Jun 18, 2015

Test build #35113 has finished for PR 6782 at commit eebf41d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class DateAdd(startDate: Expression, days: Expression) extends Expression
    • case class DateSub(startDate: Expression, days: Expression) extends Expression

}
}

override def dataType: DataType = StringType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this return a string instead of a date?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hive doc says it returns a stirng

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we follow hive so strictly? It makes more sense for me to have a DateType return type, so that we can apply date time functions on the result without casting to String and back again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually hive support these functions before there are date type support. to be compatible with older version(older than 0.12.0, where date type are first supported) and applications based on those, keeping the same interface is helpful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general though we cast back to a string whenever you need to. From an efficiency stand point it seems much better to keep it a date.

/cc @rxin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for DateType, unless we committed to 100% compatible with Hive.

@adrian-wang
Copy link
Contributor Author

ping @rxin @marmbrus

@rxin
Copy link
Contributor

rxin commented Jun 19, 2015

@cloud-fan can you add this to your review queue?

TypeCheckResult.TypeCheckFailure(
s"type of days expression in DateAdd should be int, not ${days.dataType}.")
} else {
TypeCheckResult.TypeCheckSuccess
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only allow DateType and NullType, and add rules in HiveTypeCoercion to cast other supported types to DateType.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that way, as we will need to add a rule to cast both timestamp type string and date type string intto DateType. That's very intuitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After #6759 gets in, I probably will refract DateUtils to offer some more direct interfaces, but let's skip it now to avoid huge conflicts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agreed with @cloud-fan, that also helps to simplify the implementations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do cast in HiveTypeCoercion for about 30 different Datetime Functions? That would be nasty to maintain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adrian-wang Thanks for the confirm. From the HIVE test [1], I see date_add works with tinyint, smallint, int, but there is no case for "2010-01-01 01:02:03", is it not tested side behavior?

[1] https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/udf_date_add.q

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I checked MySQL, it can cast '2012-01-01 12:21:12' to Date:

mysql> select cast('2012-01-01 12:21:12' as date);
+-------------------------------------+
| cast('2012-01-01 12:21:12' as date) |
+-------------------------------------+
| 2012-01-01                          |
+-------------------------------------+
1 row in set (0.00 sec)

This sounds reasonable to me, but not sure making it different than Hive will break real workload or not.

Because of all kinds of wired behavior in Hive, we cannot have 100% compatibility with Hive already, see many hive tests are skipped. @marmbrus, what do yo think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies agree that 100% compatibility is not a goal and mysql's behavior seems reasonable here. It would be good though to more holistically design our data handling API and make sure we aren't missing anything. For example, today we can handle queries like: WHERE timestamp > "2014-01" but only because we convert the timestamp to a string and the strings compare lexicographically as you would hope.

However, this feels like a hack to me and it would be much more efficient if we could avoid creating strings for every date and instead could covert "2014-01" into an Int internally. How does mysql handle partial dates like this?

@adrian-wang adrian-wang changed the title [SPARK-8186] [SPARK-8187] [SQL] [WIP] datetime function: date_add, date_sub [SPARK-8186] [SPARK-8187] [SQL] datetime function: date_add, date_sub Jun 19, 2015
val resultDays = startDate.dataType match {
case StringType =>
DateUtils.millisToDays(DateUtils.stringToTime(
start.asInstanceOf[UTF8String].toString).getTime) + offset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Cast we using DateUtils.fromJavaDate(Date.valueOf(s.toString)) to cast string to date. Which way is more efficient?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, you handle both date and timestamp string here.

@cloud-fan
Copy link
Contributor

DateAdd and DateSub are very similar , and logically we can just transform from one to the other by changing days to -days. Can we combine them to reduce code size?


override def dataType: DataType = StringType

override def toString: String = s"DateAdd($startDate, $days)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this as the case class will handle that pretty well.

@chenghao-intel
Copy link
Contributor

Can you remove the WIP from the title?

TypeCheckResult.TypeCheckFailure(
s"type of startdate expression in DateAdd should be string/timestamp/date," +
s" not ${startDate.dataType}")
} else if (days.dataType != IntegerType && days.dataType != NullType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also support integral types here for days?

For example, we always infer integer from Python and Json as LongType.

@SparkQA
Copy link

SparkQA commented Jun 24, 2015

Test build #35646 has finished for PR 6782 at commit 0944f03.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class DateAdd(startDate: Expression, days: Expression) extends Expression
    • case class DateSub(startDate: Expression, days: Expression) extends Expression

@SparkQA
Copy link

SparkQA commented Jun 24, 2015

Test build #35655 has finished for PR 6782 at commit 95b3b58.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class DateAdd(startDate: Expression, days: Expression) extends Expression
    • case class DateSub(startDate: Expression, days: Expression) extends Expression

@rxin
Copy link
Contributor

rxin commented Jun 30, 2015

@adrian-wang I thought about this more -- I think we need to go back to the drawing board to specify the behavior of various date functions, return types, before we can merge these expressions.

Some open questions include:

  • what data type should these expressions accept?
  • what data type should these expressions return?
  • should we support intervals?
    ...

I will make sure the design happens next week. Meantime, can you close the expressions related to dates? We can reopen them once we flush out the design details.

@adrian-wang
Copy link
Contributor Author

OK, I'll close these for now. Could you assign them all to me in case duplicated work here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants