Skip to content

Conversation

@priyankagargnitk
Copy link

@priyankagargnitk priyankagargnitk commented Jul 1, 2019

What changes were proposed in this pull request?

This change adds capability to return Calender interval type from udf by exposing CalendarIntervalType and CalendarInterval.

Earlier, the udf of Type (String => CalendarInterval) was throwing Exception stating:

Schema for type org.apache.spark.unsafe.types.CalendarInterval is not supported
java.lang.UnsupportedOperationException: Schema for type org.apache.spark.unsafe.types.CalendarInterval is not supported
at org.apache.spark.sql.catalyst.ScalaReflection391anonfun.apply(ScalaReflection.scala:781)
...

How was this patch tested?

Added test case in ScalaReflectionSuite.scala and ExpressionEncoderSuite
Also, tested by creating an udf that returns Calendar interval.

jira entry for detail: https://issues.apache.org/jira/browse/SPARK-24695

## What changes were proposed in this pull request?

This change adds capability to return Calender interval from udf.

Earlier, the  udf  of Type (String => CalendarInterval) was throwing Exception stating:
Schema for type org.apache.spark.unsafe.types.CalendarInterval is not supported
java.lang.UnsupportedOperationException: Schema for type org.apache.spark.unsafe.types.CalendarInterval is not supported
at org.apache.spark.sql.catalyst.ScalaReflection391anonfun.apply(ScalaReflection.scala:781)

## How was this patch tested?

Added test case in ScalaReflectionSuite.scala and ExpressionEncoderSuite.scala
Also, tested by creating an udf that returns Calendar interval.

jira entry for detail: https://issues.apache.org/jira/browse/SPARK-24695
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CalendarIntervalType is currently not public (it happened to be partially public) and not supposed to be exposed to users.

@dongjoon-hyun
Copy link
Member

Thank you for the contribution, @priyankagargnitk . But, I also agree with @HyukjinKwon 's comment.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 4, 2019

I realized that this is based on @hvanhovell 's request at #21679.
Hi, @hvanhovell . Could you review this PR? (Also, cc @gatorsmile , @cloud-fan )


@Test
public void equalsTest() {
CalendarInterval i1 = new CalendarInterval(3, 123);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, all indentation are broken during remove. We are using 2-space indentation in Java, too.
If you use git mv, this kind of issue will removed.

@HyukjinKwon
Copy link
Member

Okay, I rushed to read it. This PR targets to rather expose CalendarIntervalType, rather than using it somewhere else.

@priyankagargnitk Can you fix the PR title and description to explain what this PR fixes? We should fix the documentation as well, for instance, here.

There have been multiple discussions about this. cc @gatorsmile, @rxin, @cloud-fan. To me, I don't mind about exposing this.

@priyankagargnitk priyankagargnitk changed the title [SPARK-24695][SQL]: To add support to return Calendar interval from udf. [SPARK-24695][SQL]: To expose Calendar interval type, so that same can be returned from the UDF. Jul 5, 2019
@rxin
Copy link
Contributor

rxin commented Jul 5, 2019

I'm OK exposing it, but we need to mark that this API is not stable and will be broken in the future, just like all the APIs exposing internal data formats.

@cloud-fan
Copy link
Contributor

agree with @rxin. We should also update the doc of Row#get, to mention the external class of interval type.

@dongjoon-hyun
Copy link
Member

Hi, @priyankagargnitk .
Could you rebase and update this PR according to the above advices?


test("cast between string and interval") {
import org.apache.spark.unsafe.types.CalendarInterval
import org.apache.spark.sql.types.CalendarInterval
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is not needed. Let's remove this.


test("SPARK-8753: add interval type") {
import org.apache.spark.unsafe.types.CalendarInterval
import org.apache.spark.sql.types.CalendarInterval
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. (This line is not needed. Let's remove this.)

test("SPARK-8945: add and subtract expressions for interval type") {
import org.apache.spark.unsafe.types.CalendarInterval
import org.apache.spark.unsafe.types.CalendarInterval.MICROS_PER_WEEK
import org.apache.spark.sql.types.CalendarInterval
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 1497 is not needed.

absolute point in time.
- `DateType`: Represents values comprising values of fields year, month and day, without a
time-zone.
* Calendar Interval type
Copy link
Member

@dongjoon-hyun dongjoon-hyun Jul 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, one word, CalendarInterval instead of Calendar Interval?

<td> <b>CalendarIntervalType</b> </td>
<td> org.apache.spark.sql.types.CalendarInterval </td>
<td>
CalendarIntervalType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct in this context? This section is for DataTypes factory.

String sInLowerCase = s.trim().toLowerCase(Locale.ROOT);
String interval =
sInLowerCase.startsWith("interval ") ? sInLowerCase : "interval " + sInLowerCase;
String interval = sInLowerCase.startsWith("interval ") ? sInLowerCase : "interval " + sInLowerCase;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't include this kind of style change.

@SparkQA
Copy link

SparkQA commented Sep 19, 2019

Test build #110992 has finished for PR 25022 at commit e00c0dc.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member

MaxGekk commented Sep 19, 2019

do you have an estimation about how hard it is?

@cloud-fan I think we could follow the SQL standard and define 2 separate types, maybe under the new flag spark.sql.ansi.enabled introduced by #25693 :

For both types, need to support external types java.time.Duration/java.time.Period (3 man-days each, like it takes me for TimeType in #25678), and additionally need to support literals (can be taken from existing code for CalendarInterval) (1-2 days), UDF (1-2 days), and support ops defined by SQL standard (5-10 days):
Screen Shot 2019-09-19 at 15 13 26
Screen Shot 2019-09-19 at 15 13 34
plus need to support new types in existing expressions (> 10 days).
I could create sub-tasks for that https://issues.apache.org/jira/browse/SPARK-27790 if you don't mind.

The problem is the same objections are applicable for new interval types as for proposed TIME type: #25678 (comment)
I cannot show potential users for new types besides showing popularity of other systems that follow the SQL standard and support those types.

@MaxGekk
Copy link
Member

MaxGekk commented Sep 19, 2019

Spark's interval can combine microseconds and months, whereas the java ones can't.

@hvanhovell And do you know why? Read the comment of java.time.Period: "...Durations and periods differ in their treatment of daylight savings time when added to ZonedDateTime... For example, consider adding a period of one day and a duration of one day to 18:00 on the evening before a daylight savings gap. The Period will add the conceptual day and result in a ZonedDateTime at 18:00 the following day. By contrast, the Duration will add exactly 24 hours, resulting in a ZonedDateTime at 19:00 the following day (assuming a one hour DST gap)."

If you have both in one type like Spark has, the result of applying the interval depends on the order of applying its components. And the order is not defined for now. I even don't speak about various number of days in months, and microseconds in CalendarInterval can contain more micros than 31 days * MICROS_PER_DAY.

Bringing such uncertainties to user apps only because CalendarInterval has been already exposed partially doesn't make sense for me.

@cloud-fan
Copy link
Contributor

Seems java Period (defines years, months and days) doesn't match the SQL year-month interval either. The parquet one defines months, days and millisecond, which is also different from SQL standard.

The following are my personal thoughts, which is debatable.

I don't quite understand why the SQL standard defines 2 interval types. It complicates the type system and seems to me the only benefit is to make interval comparable.

I do agree that interval should hold several fields, like months and seconds, as the number of days in a month can vary. I read the comments of java Period. It convinces me that 1 day doesn't always mean 24 hours, when day-light saving happens. But I don't quite understand why java Period defines both years and months. AFAIK 1 year is always 12 months.

The parquet interval looks most reasonable to me. This is also what PostgreSQL does: https://www.postgresql.org/docs/11/datatype-datetime.html

Internally interval values are stored as months, days, and seconds. This is done because the number of days in a month varies, and a day can have 23 or 25 hours if a daylight savings time adjustment is involved. The months and days fields are integers while the seconds field can store fractions.

The proposal from me is:

  1. keep interval type as it is
  2. update CalendarInterval to follow parquet/pgsql, and expose it.

@MaxGekk
Copy link
Member

MaxGekk commented Sep 19, 2019

Seems java Period (defines years, months and days) doesn't match the SQL year-month interval either.

I think it matches. Just provide zero days while converting to periods: java.time.Period.of(years, months, 0)

I don't quite understand why the SQL standard defines 2 interval types. It complicates the type system and seems to me the only benefit is to make interval comparable.

To be comparable is one reason. Another one (related) is all fields of the day-time interval type have constant number of sub-units (1 day = 24 hours, 1 hour = 60 sec, and etc.). The same is for the year-month interval type. Total number of microseconds (nanoseconds) in a value of the day-time interval is the same, and does not depends on time zone, or number of days in the month of some timestamp.

I would like to highlight that day-time interval has unconstrained number of days. So, you can represent any intervals by this type. I think the year-month interval type was added for convenience. If you have a local timestamp like 2019-09-19 10:00:00 and want to add 2 months to it, year-month interval would be easier to use but you can reach the same result with day-time interval.

But I don't quite understand why java Period defines both years and months. AFAIK 1 year is always 12 months.

I guess they wanted to make it more flexible. The type reflects what users want to express like 6 years, -9 months and 12 days. Just in case, Period supports negative years, months and days.

The proposal from me is:
update CalendarInterval to follow parquet/pgsql, and expose it.

CalendarInterval has been already exposed partially. If we change this type, won't this break user apps?

@cloud-fan
Copy link
Contributor

CalendarInterval has been already exposed partially

Many classes in Spark SQL are semi-public, e.g. all the classes in the catalyst and execution packages. Users can use them, but there is no backward compatibility guarantee from Spark. The unsafe package is also private, and we can change CalendarInterval in a new release, even moving it to a different package.

@cloud-fan
Copy link
Contributor

If you have a local timestamp like 2019-09-19 10:00:00 and want to add 2 months to it, year-month interval would be easier to use but you can reach the same result with day-time interval.

I don't think so. To reach the same result you have to look at the timestamp and calculate how many days to add to reach to the next 2 months.

@cloud-fan
Copy link
Contributor

Similarly, in standard SQL it's hard to add a conceptual day, as the day-time interval is actually stored as seconds.

@MaxGekk
Copy link
Member

MaxGekk commented Sep 19, 2019

Similarly, in standard SQL it's hard to add a conceptual day, as the day-time interval is actually stored as seconds.

That's why both java.time.Period and java.time.Duration exist simultaneously. You can update the years month & day fields of local timestamp by fields:

scala> java.time.LocalDate.of(2019, 9, 19).plus(java.time.Period.of(-1, 2, 3))
res0: java.time.LocalDate = 2018-11-22

or if you need add concrete number of micros (nanoseconds), you can add this to time instance:

java.time.LocalDate.of(2019, 9, 19).atStartOfDay(java.time.ZoneOffset.UTC).plus(java.time.Duration.ofNanos(100000000000000L))
res7: java.time.ZonedDateTime = 2019-09-20T03:46:40Z

Both types solves different problems. Period allows to update logical time components, Duration allows to add any concrete intervals in micros.
That's why I propose to convert Catalyst's Interval types either to Period or Duration. The types open power of the standard and other libraries.

Having Spark's CalendarInterval, users face to additional restrictions:

  • They cannot compare instances of CalendarInterval. Duration is comparable
  • They cannot update year, month and day components of LocalDate/LocalDateTime because CalendarInterval contains micros and you have to apply it to a time zone
  • They cannot add CalendarInterval to an Instant because CalendarInterval contains months. So, you have to convert Instant to ZonedDateTime before adding CalendarInterval.

I don't think so. To reach the same result you have to look at the timestamp and calculate how many days to add to reach to the next 2 months.

You are right, I just want to say that day-time interval is internally consistent, and can express any intervals. Why do we need months in CalendarInterval if microseconds of CalendarInterval can represent any time intervals between any valid timestamps? It seems we are just wasting user's memory, and on Spark side as well?

Users can use them, but there is no backward compatibility guarantee from Spark.

It slightly contradict Reynold's objects in #25678 regarding modifying user app code and libraries.

@cloud-fan
Copy link
Contributor

cloud-fan commented Sep 20, 2019

IIUC the problem of #25678 is that, it's too much work to add a new data type. We should only do it with a strong use case to justify it.

Interval type is an existing data type, but hasn't been completely exposed to end-users yet. It's not that much of work to make it right and fully expose it.

In general, I'm in favor of Java's design. Period is a conceptual interval while Duration is a concrete interval. It's more powerful than the SQL standard year-month interval + day-time interval as it supports conceptual days.

Spark should support all datetime operations (e.g. date + interval, timestamp + interval, etc.) w.r.t the session local timezone. UDF can do the same thing by getting the session local timezone from SQLConf, but I don't think that's a common use case as users should call Spark functions to do datetime operations.

For example, timestamp + interval can be implemented by

  1. convert the internal long value to Instant
  2. convert Instant to ZonedDateTime
  3. extract a Period (months and days) from CalendarInterval
  4. add the Period to the ZonedDateTime
  5. convert the updated ZonedDateTime back to Instant
  6. extract Duration (seconds) from CalendarInterval
  7. add the Duration to the Instant
  8. convert the updated Instant back to a long value

CalendarInterval should contain 3 ints for months, days and seconds. We can add some methods to CalendarInterval to extract Duration and Period, so that it's easier to be used in UDF.

I think it's better to have a single interval type:

  1. simplifies the type system
  2. supports conceptual days
  3. still compatible with SQL standard
  4. compatible with parquet

The only disadvantage is we can't sort intervals, but I don't think that matters.

@MaxGekk
Copy link
Member

MaxGekk commented Sep 20, 2019

CalendarInterval should contain 3 ints for months, days and seconds.

@cloud-fan Sorry, this design choice is not clear to me. As you describe above, interval type can be "conceptual" like Period or concrete like Duration. In the first case, we should store year, month, day, and I would continue hour, minute, seconds, microsecond. So, you can change any local timestamp by its components. The second type stores diff between any valid timestamps. In our case, long is enough to store diff in microseconds.

The first type can store date-time components in the compacted way - months (int) + microseconds (long) because year can be calculated from month, and (days, hours, minutes, seconds) from microseconds. We can consider Spark CalendarInterval from this perspective.

But you propose to store days separately, why (besides of the reason that parquet does the same)?

And third component seconds confuses me as well? Why seconds but not microseconds? Do you want to lose the precision when subtracting 2 timestamps?

@cloud-fan
Copy link
Contributor

But you propose to store days separately, why (besides of the reason that parquet does the same)?

Both the Java Period classdoc and pgsql doc remind me that 1 day is not always equal to 24 hours in the context of interval.

For conceptual interval, you are right that logically it should be a combination of all the datetime fields(years, months, days, hours, ...). But since 1 year is always 12 months, 1 hour is always 60 minutes, physically we only need to store months, days and seconds (at some precision, I'd like to pick millisecond).

For concrete interval, it only needs seconds.

So overall, we need 3 fields to store an interval value: months, days and seconds.

This is also what pgsql does.

@cloud-fan
Copy link
Contributor

@MaxGekk what do you think about this proposal?

@MaxGekk
Copy link
Member

MaxGekk commented Sep 23, 2019

@cloud-fan I would consider only 2 options:

  • Keep as is, and expose existing CalendarInterval in UDF, or
  • Follow SQL standard, and expose 2 types from Java standard library - Period and Duration.

It seems nobody here likes the second option. So, I would do the first one instead of changing semantic of already exposed CalendarInterval in any ways.

@cloud-fan
Copy link
Contributor

changing semantic of already exposed CalendarInterval in any ways.

What do you mean by "already exposed"? CalendarInterval is the internal data representation of interval type, like long for timestamp type and int for date type.

Unless this PR is merged, CalendarInterval is not exposed like UTF8String. What I'm against here is exposing the interval data representation with a semantic that is spark specific, i.e. does not follow SQL standard/pgsql/parquet/...

@MaxGekk
Copy link
Member

MaxGekk commented Sep 24, 2019

What do you mean by "already exposed"? CalendarInterval is the internal data representation of interval type, like long for timestamp type and int for date type.

I mean when an user collects an interval column, she/he gets CalendarInterval:

scala> spark.sql("select interval '10' year").collect()(0).get(0).isInstanceOf[org.apache.spark.unsafe.types.CalendarInterval]
res9: Boolean = true

@cloud-fan
Copy link
Contributor

This is an unexpected and undocumented behavior, I don't agree this means CalendarInterval is exposed. The authoritative rule to indicate a class/interface is public: the class/interface must sit in a public package and it has an annotation of its public level (stable, experimental, etc.)

@tobegit3hub
Copy link

If we want to use range between with interval, it requires the column as type calendarinterval. I think we should expose the type calendarinterval or accept other types like date in the SQL support.

SELECT MAX(age) OVER w1 AS maxAge, MIN(age) OVER w1 AS minAge FROM t1 WINDOW w1 AS (PARTITION BY gender ORDER BY createTime RANGE BETWEEN INTERVAL 0 DAYS PRECEDING AND CURRENT ROW)

Here is the error message.

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '(PARTITION BY t1.`gender` ORDER BY t1.`createTime` ASC NULLS FIRST RANGE BETWEEN interval PRECEDING AND CURRENT ROW)' due to data type mismatch: The data type 'date' used in the order specification does not match the data type 'calendarinterval' which is used in the range frame.; line 1 pos 83;
'Project [max(age#10) windowspecdefinition(gender#9, createTime#11 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, -interval, currentrow$())) AS maxAge#18, min(age#10) windowspecdefinition(gender#9, createTime#11 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, -interval, currentrow$())) AS minAge#19]
+- SubqueryAlias t1
   +- LogicalRDD [graduated#6, s_id#7, name#8, gender#9, age#10, createTime#11], false

@MaxGekk
Copy link
Member

MaxGekk commented Oct 24, 2019

If we want to use range between with interval, it requires the column as type calendarinterval.

@tobegit3hub Such type must have the property of orderliness to support that but CalendarIntervalType does not have this property.

cloud-fan pushed a commit that referenced this pull request Oct 29, 2019
…valUtils

### What changes were proposed in this pull request?
In the PR, I propose to move all static methods from the `CalendarInterval` class to the `IntervalUtils` object. All those methods are rewritten from Java to Scala.

### Why are the changes needed?
- For consistency with other helper methods. Such methods were placed to the helper object `IntervalUtils`, see #26190
- Taking into account that `CalendarInterval` will be fully exposed to users in the future (see #25022), it would be nice to clean it up by moving service methods to an internal object.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- By moved tests from `CalendarIntervalSuite` to `IntervalUtilsSuite`
- By existing test suites

Closes #26261 from MaxGekk/refactoring-calendar-interval.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@github-actions
Copy link

github-actions bot commented Feb 2, 2020

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Feb 2, 2020
@github-actions github-actions bot closed this Feb 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.