Skip to content

[SPARK-26002][SQL] Fix day of year calculation for Julian calendar days #23000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

attilapiros
Copy link
Contributor

What changes were proposed in this pull request?

Fixing leap year calculations for date operators (year/month/dayOfYear) where the Julian calendars are used (before 1582-10-04). In a Julian calendar every years which are multiples of 4 are leap years (there is no extra exception for years multiples of 100).

How was this patch tested?

With a unit test ("SPARK-26002: correct day of year calculations for Julian calendar years") which focuses to these corner cases.

Manually:

scala> sql("select year('1500-01-01')").show()

+------------------------------+
|year(CAST(1500-01-01 AS DATE))|
+------------------------------+
|                          1500|
+------------------------------+

scala> sql("select dayOfYear('1100-01-01')").show()

+-----------------------------------+
|dayofyear(CAST(1100-01-01 AS DATE))|
+-----------------------------------+
|                                  1|
+-----------------------------------+

@SparkQA
Copy link

SparkQA commented Nov 10, 2018

Test build #98682 has finished for PR 23000 at commit f2bc0b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@attilapiros
Copy link
Contributor Author

ping @10110346 @gatorsmile

@squito
Copy link
Contributor

squito commented Nov 20, 2018

lgtm

@SparkQA
Copy link

SparkQA commented Nov 20, 2018

Test build #99054 has finished for PR 23000 at commit 421673a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Dec 17, 2018

Test build #100216 has finished for PR 23000 at commit 421673a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

cc @cloud-fan , @gatorsmile , @mgaido91

@cloud-fan
Copy link
Contributor

what was the behavior before this PR?

@attilapiros
Copy link
Contributor Author

attilapiros commented Dec 18, 2018

@cloud-fan It was off by one (at least at 1500, with each 100 years back in time the difference increases) as I described in the Jira:

+-----------------------------------+
|dayofyear(CAST(1500-01-02 AS DATE))|
+-----------------------------------+
|                                  1|
+-----------------------------------+

scala> sql("select year('1500-01-01')").show()
+------------------------------+
|year(CAST(1500-01-01 AS DATE))|
+------------------------------+
|                          1499|
+------------------------------+


scala> sql("select month('1500-01-01')").show()
+-------------------------------+
|month(CAST(1500-01-01 AS DATE))|
+-------------------------------+
|                             12|
+-------------------------------+


scala> sql("select dayOfYear('1500-01-01')").show()
+-----------------------------------+
|dayofyear(CAST(1500-01-01 AS DATE))|
+-----------------------------------+
|                                365|
+-----------------------------------+

@gatorsmile
Copy link
Member

cc @MaxGekk who is touching this

@MaxGekk
Copy link
Member

MaxGekk commented Dec 29, 2018

@gatorsmile Thanx for the ping. Actually current implementation support Proleptic Gregorian calendar even if it is not documented explicitly. Recently we switched on this chronology for parsing/formatting date/timestamps in JSON/CSV datasource (see #23150 and #23196) and in functions (see #23358).

If we support the hybrid calendar (Julian + Gregorian) here, it will introduce additional inconsistency.

@attilapiros If you need the hybrid calendar for some reasons, you should extend the DayOfYear expression by additional argument to set chronology (calendar system) otherwise you can break user's apps potentially.

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case, SQL standard counts dates exclusively in the Gregorian calendar, even for years before that calendar was in use.

@attilapiros
Copy link
Contributor Author

@MaxGekk thanks for the explanation, but I still a bit uncertain.

My main intention was about correcting the dayOfYear/month/year operators as the result in the following example are very strange (and not consistent with PostgreSQL too):

scala> sql("select year('1500-01-01')").show()
+------------------------------+
|year(CAST(1500-01-01 AS DATE))|
+------------------------------+
|                          1499|
+------------------------------+

scala> sql("select month('1500-01-01')").show()
+-------------------------------+
|month(CAST(1500-01-01 AS DATE))|
+-------------------------------+
|                             12|
+-------------------------------+


scala> sql("select dayOfYear('1500-01-01')").show()
+-----------------------------------+
|dayofyear(CAST(1500-01-01 AS DATE))|
+-----------------------------------+
|                                365|
+-----------------------------------+

Running the corresponding expressions in PostgreSQL 9.6:

>    SELECT DATE_PART('year', TIMESTAMP '1500-01-01 0:0:0') as "year";

| year |
| ---- |
| 1500 |

>   SELECT DATE_PART('month', TIMESTAMP '1500-01-01 0:0:0') as "month";

| month |
| ----- |
| 1     |

>  SELECT DATE_PART('doy', TIMESTAMP '1500-01-01 0:0:0') as "doy";

| doy |
| --- |
| 1   |

View on DB Fiddle

Is there any other DBMS I can take as a reference?

@MaxGekk
Copy link
Member

MaxGekk commented Jan 2, 2019

correcting the dayOfYear/month/year operators as the result in the following example are very strange ...

I see, we still use hybrid calendar in casting UTF8String to timestamps. An alternative solution could be rewriting this code:

val c = if (tz.isEmpty) {
Calendar.getInstance(timeZone)
} else {
Calendar.getInstance(
getTimeZone(f"GMT${tz.get.toChar}${segments(7)}%02d:${segments(8)}%02d"))
}
c.set(Calendar.MILLISECOND, 0)
if (justTime) {
c.set(Calendar.HOUR_OF_DAY, segments(3))
c.set(Calendar.MINUTE, segments(4))
c.set(Calendar.SECOND, segments(5))
} else {
c.set(segments(0), segments(1) - 1, segments(2), segments(3), segments(4), segments(5))
}
Some(c.getTimeInMillis * 1000 + segments(6))

by using java.time API based on Proleptic Gregorian Calendar but such changes look like more heavy. I would guess they will be done to follow SQL standard in the end, so, your fix for Julian calendar will be not needed. At the moment, I agree we need your fix to handle results from stringToTimestamp properly.

@attilapiros
Copy link
Contributor Author

Ok. Then tomorrow I will add the requested tests and we can progress with this PR.

@SparkQA
Copy link

SparkQA commented Jan 3, 2019

Test build #100692 has finished for PR 23000 at commit 300ec39.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@attilapiros
Copy link
Contributor Author

gentle ping @dongjoon-hyun, @cloud-fan, @gatorsmile

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in c101182 Jan 8, 2019
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Fixing leap year calculations for date operators (year/month/dayOfYear) where the Julian calendars are used (before 1582-10-04). In a Julian calendar every years which are multiples of 4 are leap years (there is no extra exception for years multiples of 100).

## How was this patch tested?

With a unit test ("SPARK-26002: correct day of year calculations for Julian calendar years") which focuses to these corner cases.

Manually:

```
scala> sql("select year('1500-01-01')").show()

+------------------------------+
|year(CAST(1500-01-01 AS DATE))|
+------------------------------+
|                          1500|
+------------------------------+

scala> sql("select dayOfYear('1100-01-01')").show()

+-----------------------------------+
|dayofyear(CAST(1100-01-01 AS DATE))|
+-----------------------------------+
|                                  1|
+-----------------------------------+
```

Closes apache#23000 from attilapiros/julianOffByDays.

Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@attilapiros attilapiros deleted the julianOffByDays branch May 8, 2019 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants