-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24695][SQL] Move CalendarInterval to org.apache.spark.sql.types package
#25022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
## What changes were proposed in this pull request? This change adds capability to return Calender interval from udf. Earlier, the udf of Type (String => CalendarInterval) was throwing Exception stating: Schema for type org.apache.spark.unsafe.types.CalendarInterval is not supported java.lang.UnsupportedOperationException: Schema for type org.apache.spark.unsafe.types.CalendarInterval is not supported at org.apache.spark.sql.catalyst.ScalaReflection391anonfun.apply(ScalaReflection.scala:781) ## How was this patch tested? Added test case in ScalaReflectionSuite.scala and ExpressionEncoderSuite.scala Also, tested by creating an udf that returns Calendar interval. jira entry for detail: https://issues.apache.org/jira/browse/SPARK-24695
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CalendarIntervalType is currently not public (it happened to be partially public) and not supposed to be exposed to users.
|
Thank you for the contribution, @priyankagargnitk . But, I also agree with @HyukjinKwon 's comment. |
|
I realized that this is based on @hvanhovell 's request at #21679. |
sql/catalyst/src/test/java/org/apache/spark/sql/types/CalendarIntervalSuite.java
Show resolved
Hide resolved
|
|
||
| @Test | ||
| public void equalsTest() { | ||
| CalendarInterval i1 = new CalendarInterval(3, 123); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, all indentation are broken during remove. We are using 2-space indentation in Java, too.
If you use git mv, this kind of issue will removed.
|
Okay, I rushed to read it. This PR targets to rather expose @priyankagargnitk Can you fix the PR title and description to explain what this PR fixes? We should fix the documentation as well, for instance, here. There have been multiple discussions about this. cc @gatorsmile, @rxin, @cloud-fan. To me, I don't mind about exposing this. |
This reverts commit d6decb2.
|
I'm OK exposing it, but we need to mark that this API is not stable and will be broken in the future, just like all the APIs exposing internal data formats. |
|
agree with @rxin. We should also update the doc of |
|
Hi, @priyankagargnitk . |
|
|
||
| test("cast between string and interval") { | ||
| import org.apache.spark.unsafe.types.CalendarInterval | ||
| import org.apache.spark.sql.types.CalendarInterval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is not needed. Let's remove this.
|
|
||
| test("SPARK-8753: add interval type") { | ||
| import org.apache.spark.unsafe.types.CalendarInterval | ||
| import org.apache.spark.sql.types.CalendarInterval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto. (This line is not needed. Let's remove this.)
| test("SPARK-8945: add and subtract expressions for interval type") { | ||
| import org.apache.spark.unsafe.types.CalendarInterval | ||
| import org.apache.spark.unsafe.types.CalendarInterval.MICROS_PER_WEEK | ||
| import org.apache.spark.sql.types.CalendarInterval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 1497 is not needed.
docs/sql-reference.md
Outdated
| absolute point in time. | ||
| - `DateType`: Represents values comprising values of fields year, month and day, without a | ||
| time-zone. | ||
| * Calendar Interval type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, one word, CalendarInterval instead of Calendar Interval?
docs/sql-reference.md
Outdated
| <td> <b>CalendarIntervalType</b> </td> | ||
| <td> org.apache.spark.sql.types.CalendarInterval </td> | ||
| <td> | ||
| CalendarIntervalType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct in this context? This section is for DataTypes factory.
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/SpecializedGetters.java
Show resolved
Hide resolved
| String sInLowerCase = s.trim().toLowerCase(Locale.ROOT); | ||
| String interval = | ||
| sInLowerCase.startsWith("interval ") ? sInLowerCase : "interval " + sInLowerCase; | ||
| String interval = sInLowerCase.startsWith("interval ") ? sInLowerCase : "interval " + sInLowerCase; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't include this kind of style change.
|
Test build #110992 has finished for PR 25022 at commit
|
@cloud-fan I think we could follow the SQL standard and define 2 separate types, maybe under the new flag For both types, need to support external types The problem is the same objections are applicable for new interval types as for proposed |
@hvanhovell And do you know why? Read the comment of If you have both in one type like Spark has, the result of applying the interval depends on the order of applying its components. And the order is not defined for now. I even don't speak about various number of days in months, and microseconds in Bringing such uncertainties to user apps only because |
|
Seems java The following are my personal thoughts, which is debatable. I don't quite understand why the SQL standard defines 2 interval types. It complicates the type system and seems to me the only benefit is to make interval comparable. I do agree that interval should hold several fields, like months and seconds, as the number of days in a month can vary. I read the comments of java The parquet interval looks most reasonable to me. This is also what PostgreSQL does: https://www.postgresql.org/docs/11/datatype-datetime.html
The proposal from me is:
|
I think it matches. Just provide zero days while converting to periods:
To be comparable is one reason. Another one (related) is all fields of the I would like to highlight that
I guess they wanted to make it more flexible. The type reflects what users want to express like
|
Many classes in Spark SQL are semi-public, e.g. all the classes in the |
I don't think so. To reach the same result you have to look at the timestamp and calculate how many days to add to reach to the next 2 months. |
|
Similarly, in standard SQL it's hard to add a conceptual day, as the day-time interval is actually stored as seconds. |
That's why both scala> java.time.LocalDate.of(2019, 9, 19).plus(java.time.Period.of(-1, 2, 3))
res0: java.time.LocalDate = 2018-11-22or if you need add concrete number of micros (nanoseconds), you can add this to time instance: java.time.LocalDate.of(2019, 9, 19).atStartOfDay(java.time.ZoneOffset.UTC).plus(java.time.Duration.ofNanos(100000000000000L))
res7: java.time.ZonedDateTime = 2019-09-20T03:46:40ZBoth types solves different problems. Having Spark's
You are right, I just want to say that
It slightly contradict Reynold's objects in #25678 regarding modifying user app code and libraries. |
|
IIUC the problem of #25678 is that, it's too much work to add a new data type. We should only do it with a strong use case to justify it. Interval type is an existing data type, but hasn't been completely exposed to end-users yet. It's not that much of work to make it right and fully expose it. In general, I'm in favor of Java's design. Spark should support all datetime operations (e.g. date + interval, timestamp + interval, etc.) w.r.t the session local timezone. UDF can do the same thing by getting the session local timezone from SQLConf, but I don't think that's a common use case as users should call Spark functions to do datetime operations. For example,
I think it's better to have a single interval type:
The only disadvantage is we can't sort intervals, but I don't think that matters. |
@cloud-fan Sorry, this design choice is not clear to me. As you describe above, interval type can be "conceptual" like The first type can store date-time components in the compacted way - But you propose to store And third component |
Both the Java For conceptual interval, you are right that logically it should be a combination of all the datetime fields(years, months, days, hours, ...). But since 1 year is always 12 months, 1 hour is always 60 minutes, physically we only need to store months, days and seconds (at some precision, I'd like to pick millisecond). For concrete interval, it only needs seconds. So overall, we need 3 fields to store an interval value: months, days and seconds. This is also what pgsql does. |
|
@MaxGekk what do you think about this proposal? |
|
@cloud-fan I would consider only 2 options:
It seems nobody here likes the second option. So, I would do the first one instead of changing semantic of already exposed |
What do you mean by "already exposed"? Unless this PR is merged, |
I mean when an user collects an interval column, she/he gets |
|
This is an unexpected and undocumented behavior, I don't agree this means |
|
If we want to use Here is the error message. |
@tobegit3hub Such type must have the property of orderliness to support that but |
…valUtils ### What changes were proposed in this pull request? In the PR, I propose to move all static methods from the `CalendarInterval` class to the `IntervalUtils` object. All those methods are rewritten from Java to Scala. ### Why are the changes needed? - For consistency with other helper methods. Such methods were placed to the helper object `IntervalUtils`, see #26190 - Taking into account that `CalendarInterval` will be fully exposed to users in the future (see #25022), it would be nice to clean it up by moving service methods to an internal object. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By moved tests from `CalendarIntervalSuite` to `IntervalUtilsSuite` - By existing test suites Closes #26261 from MaxGekk/refactoring-calendar-interval. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |


What changes were proposed in this pull request?
This change adds capability to return Calender interval type from udf by exposing
CalendarIntervalTypeandCalendarInterval.Earlier, the udf of Type
(String => CalendarInterval)was throwing Exception stating:How was this patch tested?
Added test case in
ScalaReflectionSuite.scalaandExpressionEncoderSuiteAlso, tested by creating an udf that returns Calendar interval.
jira entry for detail: https://issues.apache.org/jira/browse/SPARK-24695