Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Feb 7, 2020

What changes were proposed in this pull request?

In the PR, I propose to rewrite the millisToDays and daysToMillis of DateTimeUtils using Java 8 time API.

I removed getOffsetFromLocalMillis from DateTimeUtils because it is a private methods, and is not used anymore in Spark SQL.

Why are the changes needed?

New implementation is based on Proleptic Gregorian calendar which has been already used by other date-time functions. This changes make millisToDays and daysToMillis consistent to rest Spark SQL API related to date & time operations.

Does this PR introduce any user-facing change?

Yes, this might effect behavior for old dates before 1582 year.

How was this patch tested?

By existing test suites DateTimeUtilsSuite, DateFunctionsSuite, DateExpressionsSuite, SQLQuerySuiteandHiveResultSuite`.

val maxMonthInterval = 10000 * 12
checkEvaluation(
AddMonths(Literal(Date.valueOf("0001-01-01")), Literal(maxMonthInterval)), 2933261)
AddMonths(Literal(LocalDate.parse("0001-01-01")), Literal(maxMonthInterval)), 2933263)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to replace Date.valueOf by LocalDate.parse due to migration on another calendar from Julian to Gregorian. The expected number of days is easy to get:

val a = LocalDate.of(10001, 1, 1).toEpochDay
println(a)
2933263

Copy link
Contributor

@cloud-fan cloud-fan Feb 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we place the hardcoded 2933263 with LocalDate.of(10001, 1, 1).toEpochDay?

@SparkQA
Copy link

SparkQA commented Feb 8, 2020

Test build #118046 has finished for PR 27494 at commit cc4c07d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 8, 2020

Test build #118045 has finished for PR 27494 at commit 9a2b7be.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 8, 2020

Test build #118061 has finished for PR 27494 at commit 1890589.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 8, 2020

Test build #118072 has finished for PR 27494 at commit 6a64946.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def checkFromToJavaDate(d1: Date): Unit = {
val d2 = toJavaDate(fromJavaDate(d1))
assert(d2.toString === d1.toString)
assert(format(d2) === format(d1))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to use the formatter to avoid calendar problems - Date.toString formats slightly differently than DateTimeFormatter, I guess because of different year lengths in calendars.

@MaxGekk MaxGekk changed the title [WIP] Port millisToDays and daysToMillis on Java 8 time API [SPARK-30760][SQL] Port millisToDays and daysToMillis on Java 8 time API Feb 8, 2020
@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 8, 2020

jenkins, retest this, please

@MaxGekk MaxGekk requested review from cloud-fan and srowen February 8, 2020 20:55
@SparkQA
Copy link

SparkQA commented Feb 9, 2020

Test build #118076 has finished for PR 27494 at commit 6a64946.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

struct<date_trunc(DECADE, CAST(to_date('0002-12-31 BC', 'yyyy-MM-dd G') AS TIMESTAMP)):timestamp>
-- !query output
-0010-01-01 00:07:02
-0010-01-01 00:00:00
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to ask if there are any behavior changes. This is one, but it looks like it was wrong before?

Are there any others that we know of? Given recent discussions I'm extra cautious about introducing behavior changes I suppose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This had been already described in the SQL migration guide. The numbers are different because internally we use Proleptic Gregorian calendar in date_trunc but when we collect data to the driver, we convert internal timestamps to java.sql.Timestamp. Actually, we switch to Julian calendar for old dates here.

By settings localSparkSession.conf.set(SQLConf.DATETIME_JAVA8API_ENABLED.key, true) in SQLQueryTestSuite.scala, I just keep the same calendar in all transformations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is caused by https://github.com/apache/spark/pull/27494/files#diff-432455394ca50800d5de508861984ca5R340 , not the fix itself, right?

Do we have any known behavior changes caused by this fix?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the change in this particular line is caused by localSparkSession.conf.set(SQLConf.DATETIME_JAVA8API_ENABLED.key, true). But I had to set the settings to fix another test failures like in #27494 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's helpful to do #27494 (comment) first and then we can know what gets changed exactly by this fix?

localSparkSession.conf.set(SQLConf.ANSI_ENABLED.key, true)
case _ =>
}
localSparkSession.conf.set(SQLConf.DATETIME_JAVA8API_ENABLED.key, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bug in HiveResult, not just this test suite.

When we produce string result, we must stick with the Proleptic Gregorian calendar and call LocalDate/Instant.toString. I think we should update HiveResult.hiveResultString and enable java 8 datetime API before calling collect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HiveResult is used in hive-thriftserver too. If I set spark.sql.datetime.java8API.enabled to true in HiveResult.hiveResultString, this might effect on the Thrift server. Though I can correct expected results in the tests, but Thrift server may require broader changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we open a new PR to do it? I believe we need to do this after switching to Proleptic Gregorian calendar. We should avoid using different calendars inconsistently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will do that after this PR will be merged because I will need the changes, most likely.

@SparkQA
Copy link

SparkQA commented Feb 10, 2020

Test build #118152 has finished for PR 27494 at commit cb37fe3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 10, 2020

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Feb 10, 2020

Test build #118166 has finished for PR 27494 at commit cb37fe3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

are there any known behavior changes (bug fixes) caused by this patch?

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 12, 2020

are there any known behavior changes (bug fixes) caused by this patch?

  1. millisToDays(millisUtc: Long) is called by fromJavaDate() which is used in many places for converting java.sql.Date to the number of days since epoch. For example, HiveResult, JDBC, ORC deserializer, Parquet Filters, obviously literals, Hive HadoopTableReader
  2. millisToDays(millisUtc: Long, timeZone: TimeZone) is called from CurrentBatchTimestamp (I think we can avoid that), monthsBetween, truncTimestamp (we can avoid this by [SPARK-28596][SQL] Use Java 8 time API in the date_trunc() function #25329)
  3. daysToMillis(days: SQLDate) is called from toJavaDate() which is opposite to fromJavaDate() and is used in the same places.
  4. daysToMillis(days: SQLDate, timeZone: TimeZone) is called from monthsBetween, timestamp trunctions (we can avoid that).

NOTE: This affects only old dates/timestamps before 1582 year.

@cloud-fan
Copy link
Contributor

OK let's get this in 3.0 to stick with Proleptic Gregorian calendar. Can we fix the conflicts?

…ys-java8-api

# Conflicts:
#	sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala
@SparkQA
Copy link

SparkQA commented Feb 12, 2020

Test build #118300 has finished for PR 27494 at commit e5d463f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan closed this in aa0d136 Feb 12, 2020
@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

@MaxGekk let's fix #27494 (comment) in a followup.

cloud-fan pushed a commit that referenced this pull request Feb 12, 2020
…ime API

### What changes were proposed in this pull request?
In the PR, I propose to rewrite the `millisToDays` and `daysToMillis` of `DateTimeUtils` using Java 8 time API.

I removed `getOffsetFromLocalMillis` from `DateTimeUtils` because it is a private methods, and is not used anymore in Spark SQL.

### Why are the changes needed?
New implementation is based on Proleptic Gregorian calendar which has been already used by other date-time functions. This changes make `millisToDays` and `daysToMillis` consistent to rest Spark SQL API related to date & time operations.

### Does this PR introduce any user-facing change?
Yes, this might effect behavior for old dates before 1582 year.

### How was this patch tested?
By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, DateExpressionsSuite`, `SQLQuerySuite` and `HiveResultSuite`.

Closes #27494 from MaxGekk/millis-2-days-java8-api.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit aa0d136)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…ime API

### What changes were proposed in this pull request?
In the PR, I propose to rewrite the `millisToDays` and `daysToMillis` of `DateTimeUtils` using Java 8 time API.

I removed `getOffsetFromLocalMillis` from `DateTimeUtils` because it is a private methods, and is not used anymore in Spark SQL.

### Why are the changes needed?
New implementation is based on Proleptic Gregorian calendar which has been already used by other date-time functions. This changes make `millisToDays` and `daysToMillis` consistent to rest Spark SQL API related to date & time operations.

### Does this PR introduce any user-facing change?
Yes, this might effect behavior for old dates before 1582 year.

### How was this patch tested?
By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, DateExpressionsSuite`, `SQLQuerySuite` and `HiveResultSuite`.

Closes apache#27494 from MaxGekk/millis-2-days-java8-api.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@MaxGekk MaxGekk deleted the millis-2-days-java8-api branch June 5, 2020 19:43
cloud-fan pushed a commit that referenced this pull request Jun 19, 2020
### What changes were proposed in this pull request?

1. Move TimeZoneUTC and TimeZoneGMT to DateTimeTestUtils
2. Remove TimeZoneGMT
3. Use ZoneId.systemDefault() instead of defaultTimeZone().toZoneId
4. Alias SQLDate & SQLTimestamp to internal types of DateType and TimestampType
5. Avoid one `*` `DateTimeUtils`.`in fromJulianDay()`
6. Use toTotalMonths in `DateTimeUtils`.`subtractDates()`
7. Remove `julianCommonEraStart`, `timestampToString()`, `microsToEpochDays()`, `epochDaysToMicros()`, `instantToDays()` from `DateTimeUtils`.
8. Make splitDate() private.
9. Remove `def daysToMicros(days: Int): Long` and `def microsToDays(micros: Long): Int`.

### Why are the changes needed?

This simplifies the common code related to date-time operations, and should improve maintainability. In particular:

1. TimeZoneUTC and TimeZoneGMT are moved to DateTimeTestUtils because they are used only in tests
2. TimeZoneGMT can be removed because it is equal to TimeZoneUTC
3. After the PR #27494, Spark expressions and DateTimeUtils functions switched to ZoneId instead of TimeZone completely. `defaultTimeZone()` with `TimeZone` as return type is not needed anymore.
4. SQLDate and SQLTimestamp types can be explicitly aliased to internal types of DateType and and TimestampType instead of declaring this in a comment.
5. Avoid one `*` `DateTimeUtils`.`in fromJulianDay()`.
6. Use toTotalMonths in `DateTimeUtils`.`subtractDates()`.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing test suites

Closes #27617 from MaxGekk/move-time-zone-consts.

Lead-authored-by: Max Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants