[SPARK-26243][SQL] Use java.time API for parsing timestamps and dates from JSON #23196

MaxGekk · 2018-12-01T18:32:13Z

What changes were proposed in this pull request?

In the PR, I propose to switch on java.time API for parsing timestamps and dates from JSON inputs with microseconds precision. The SQL config spark.sql.legacy.timeParser.enabled allow to switch back to previous behavior with using java.text.SimpleDateFormat/FastDateFormat for parsing/generating timestamps/dates.

How was this patch tested?

It was tested by JsonExpressionsSuite, JsonFunctionsSuite and JsonSuite.

SparkQA · 2018-12-01T20:25:10Z

Test build #99555 has finished for PR 23196 at commit 4646ded.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-01T23:44:27Z

Test build #99558 has finished for PR 23196 at commit f326042.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

# Conflicts: # docs/sql-migration-guide-upgrade.md

…in second lost because DateType contains only days since epoch

SparkQA · 2018-12-02T17:01:09Z

Test build #99571 has finished for PR 23196 at commit 55f2eac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-12-02T19:10:15Z

docs/sql-migration-guide-upgrade.md

@@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide

  - Spark applications which are built with Spark version 2.4 and prior, and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 3.0, as they are not binary compatible with Spark 3.0.

+  - Since Spark 3.0, JSON datasource uses java.time API for parsing and generating JSON content. New formatting implementation supports date/timestamp patterns conformed to ISO 8601. To switch back to the implementation used in Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`.


The impact is not clearly documented.

What is the behavior changes?

New implementation and old one have slightly different pattern formats. See https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html and https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html . And two Java API can have different behaviours. Besides of that, new one can parse timestamps with microseconds precision as a consequence of using Java 8 java.time API.

@gatorsmile What would I you recommend to improve the text? I can add the links above, so, an user can figure out what is difference in their particular case. Our tests don't show any difference on our default timestamp/date patterns but the user can use something more specific and face to behaviour change.

I think we can add an example that shows the diff. IIRC it has a difference about exact match or non-exact match.

I added an example when there is a difference, and updated the migration guide.

HyukjinKwon · 2018-12-04T00:56:17Z

...ompatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala

@@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter {
  override def beforeAll() {
    super.beforeAll()
    TestHive.setCacheTables(true)
-    // Timezone is fixed to America/Los_Angeles for those timezone sensitive tests (timestamp_*)
-    TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
+    // Timezone is fixed to GMT for those timezone sensitive tests (timestamp_*)


@MaxGekk, BTW, why does this have to be GMT?

While porting on new parser/formatter, I faced to 2problems at least:

The time zone from SQL config is not taken into account on parsing at all. Basically used functions take default time zone from jvm settings. It could be fixed by TimeZone.setDefault or using absolute values.

Round trip in parsing a date to DateType and back to a date as a string could give different string because DateType stores only days (as Int) since epoch (in UTC). And such representation loses time zone offset. So, exact matching is impossible due to lack of information. For example, roundtrip converting for TimestampType works perfectly. This is the case for the changes. Previously, it worked because the specified time zone is not used at all (did not impact on number of days while converting a string to DateType). With new parser/formatter, it becomes matter, and I have to change time zone to GMT to eliminate the problem of loosing timezone offsets (it is zero for GMT).

Our current approach for converting dates is inconsistent in a few places, for example:

UTF8String -> num days uses hardcoded GMT and ignores SQL config:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

Line 493 in f982ca0

val c = threadLocalGmtCalendar.get()

String -> java.util.Date ignores Spark's time zone settings, and uses system time zone:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

Line 186 in f982ca0

Date.valueOf(s)

In many places even a function accepts timeZone parameter, it is not passed (used default time zone - not from config but from TimeZone.getDefault()). For example:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala

Line 187 in 36edbac

DateTimeUtils.dateToString(DateTimeUtils.fromJavaDate(d))

.

Casting to the date type depends on type of argument, if it is TimestampType, expression-wise timezone is used, otherwise GMT:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

Lines 403 to 410 in d03e0af

private[this] def castToDate(from: DataType): Any => Any = from match {

case StringType =>

buildCast[UTF8String](_, s => DateTimeUtils.stringToDate(s).orNull)

case TimestampType =>

// throw valid precision more than seconds, according to Hive.

// Timestamp.nanos is in 0 to 999,999,999, no more than a second.

buildCast[Long](_, t => DateTimeUtils.millisToDays(t / 1000L, timeZone))

}

I do really think to disable new parser/formatter outside of CSV/JSON datasources because it is hard to guarantee consistent behavior in combination with other date/timestamp functions. @srowen @gatorsmile @HyukjinKwon WDYT?

I think consistency is indeed a problem, but why disable the new parser, rather than make this consistent? I haven't looked into whether there's a good reason they behave differently but suspect not.

Tests passed on new parser. I reverted back all settings for HiveCompatibilitySuite

SparkQA · 2018-12-05T00:30:14Z

Test build #99685 has finished for PR 23196 at commit 57600e2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
logError(s\"Failed to load class $childMainClass.\")
class CSVInferSchema(val options: CSVOptions) extends Serializable
class InterpretedSafeProjection(expressions: Seq[Expression]) extends Projection

SparkQA · 2018-12-05T15:25:15Z

Test build #99714 has finished for PR 23196 at commit 07fcf46.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-12-05T15:27:53Z

jenkins, retest this, please

SparkQA · 2018-12-05T19:26:32Z

Test build #99734 has finished for PR 23196 at commit 07fcf46.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-08T03:24:05Z

Test build #99847 has finished for PR 23196 at commit 6b6ea8a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-08T04:00:24Z

Test build #99848 has finished for PR 23196 at commit 244654b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

cloud-fan · 2018-12-13T12:54:18Z

sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala

-            new Random(System.nanoTime())
-          ).getOrElse {
-            fail(s"Failed to create data generator for schema $dataType")
+    withSQLConf(SQLConf.SESSION_LOCAL_TIMEZONE.key -> "UTC") {


I'm a little worried here. This test is a round-trip test, do you mean if we write out a date/timestamp to json and read it back, the values will be different if session timezone is not UTC?

It should be same, if the session local timezone doesn't change between write and read back.

I'm a little worried here. This test is a round-trip test ...
It should be same, if the session local timezone doesn't change between write and read back.

Not only JSON parser/formatter involved in the loop but also converting milliseconds to Java's Timestamp and to something else.

converting milliseconds to Java's Timestamp and to something else.

these don't matter once the dataframe is created.

The problem is, if we have a dataframe(no matter how it is generated), we write it out and read it back. If it becomes different, we have a bug.

Let me look at it deeper.

can we remove the timezone setting here? Then we can look at jenkens report and see which seed can reproduce the bug and debug it locally.

see which seed can reproduce the bug and debug it locally.

I ran it locally many times. It is almost 100% reproducible for any seed.

What about to put the test under the flag spark.sql.legacy.timeParser.enabled and create a separate JIRA ticket? I would believe the bug somewhere in Spark's home made date/time functions rather than Java 8 implementation of timestamps parsing.

SGTM. Can you create the ticket? And put a TODO here which refers to the ticket.

Here is the ticket: https://issues.apache.org/jira/browse/SPARK-26374 and I added TODO

SparkQA · 2018-12-13T15:59:02Z

Test build #100091 has finished for PR 23196 at commit 0c7b96b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-13T16:45:09Z

Test build #100103 has started for PR 23196 at commit bbaff09.

SparkQA · 2018-12-14T14:05:28Z

Test build #100146 has finished for PR 23196 at commit 363482e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DateTimestampFormatterSuite extends SparkFunSuite with SQLHelper

SparkQA · 2018-12-14T16:09:29Z

Test build #100152 has finished for PR 23196 at commit 07e0bf8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-14T21:51:40Z

Test build #100159 has finished for PR 23196 at commit c12da1f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-15T14:11:39Z

Test build #100185 has finished for PR 23196 at commit 60ab5b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-12-16T01:32:27Z

thanks, merging to master!

HyukjinKwon · 2018-12-17T02:58:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala

@@ -28,31 +28,44 @@ import org.apache.commons.lang3.time.FastDateFormat

 import org.apache.spark.sql.internal.SQLConf

-sealed trait DateTimeFormatter {
+sealed trait TimestampFormatter {


Why did we name it TimestampFormatter? It has DateFormatter as well.

we have another trait: DateFormatter

Eh, sorry I mean the file name @cloud-fan.

will fix it in #23329

…by DateTimeFormatter in comments ## What changes were proposed in this pull request? The PRs #23150 and #23196 switched JSON and CSV datasources on new formatter for dates/timestamps which is based on `DateTimeFormatter`. In this PR, I replaced `SimpleDateFormat` by `DateTimeFormatter` to reflect the changes. Closes #23374 from MaxGekk/java-time-docs. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… from JSON ## What changes were proposed in this pull request? In the PR, I propose to switch on **java.time API** for parsing timestamps and dates from JSON inputs with microseconds precision. The SQL config `spark.sql.legacy.timeParser.enabled` allow to switch back to previous behavior with using `java.text.SimpleDateFormat`/`FastDateFormat` for parsing/generating timestamps/dates. ## How was this patch tested? It was tested by `JsonExpressionsSuite`, `JsonFunctionsSuite` and `JsonSuite`. Closes apache#23196 from MaxGekk/json-time-parser. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…by DateTimeFormatter in comments ## What changes were proposed in this pull request? The PRs apache#23150 and apache#23196 switched JSON and CSV datasources on new formatter for dates/timestamps which is based on `DateTimeFormatter`. In this PR, I replaced `SimpleDateFormat` by `DateTimeFormatter` to reflect the changes. Closes apache#23374 from MaxGekk/java-time-docs. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… from JSON ## What changes were proposed in this pull request? In the PR, I propose to switch on **java.time API** for parsing timestamps and dates from JSON inputs with microseconds precision. The SQL config `spark.sql.legacy.timeParser.enabled` allow to switch back to previous behavior with using `java.text.SimpleDateFormat`/`FastDateFormat` for parsing/generating timestamps/dates. ## How was this patch tested? It was tested by `JsonExpressionsSuite`, `JsonFunctionsSuite` and `JsonSuite`. Closes apache#23196 from MaxGekk/json-time-parser. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…by DateTimeFormatter in comments ## What changes were proposed in this pull request? The PRs apache#23150 and apache#23196 switched JSON and CSV datasources on new formatter for dates/timestamps which is based on `DateTimeFormatter`. In this PR, I replaced `SimpleDateFormat` by `DateTimeFormatter` to reflect the changes. Closes apache#23374 from MaxGekk/java-time-docs. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

MaxGekk added 4 commits December 1, 2018 12:01

Adding DateTimeFormatter

fb10b91

Support DateTimeFormatter by JacksonParser and JacksonGenerator

a9b39ec

Make test independent from current time zone

ff589f5

Fix a test by new fallback

4646ded

MaxGekk and others added 4 commits December 1, 2018 22:22

Set time zone explicitly

1c838e0

Updating the migration guide

142f301

Fix the migration guide by replacing CSV by JSON

606da21

Inlining method's arguments

f326042

MaxGekk added 6 commits December 2, 2018 11:24

A test for roundtrip timestamp parsing

4120228

Merge remote-tracking branch 'origin/master' into json-time-parser

6689747

# Conflicts: # docs/sql-migration-guide-upgrade.md

Set time zone to GMT to eliminate of situation when time zone offset …

e575162

…in second lost because DateType contains only days since epoch

UTC -> GMT

a35d5bf

Using floorDiv to take days from seconds

2a2085d

Removing unnecessary time zone settings

55f2eac

gatorsmile reviewed Dec 2, 2018

View reviewed changes

HyukjinKwon reviewed Dec 4, 2018

View reviewed changes

Merge remote-tracking branch 'origin/master' into json-time-parser

57600e2

Using legacy parser in HiveCompatibilitySuite

07fcf46

MaxGekk added 2 commits December 8, 2018 00:30

Enable new parser in HiveCompatibilitySuit

6b6ea8a

Remove saving legacy parser settings

244654b

Updating migration guide

015fdce

MaxGekk added 4 commits December 13, 2018 10:57

toInstant -> toInstantWithZoneId

24b1e3d

Set time zone in the test

9a11515

GMT -> UTC

4b01d05

DateTimeFormatter -> TimestampFormatter

0c7b96b

cloud-fan reviewed Dec 13, 2018

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Dec 13, 2018

View reviewed changes

timeParser -> timestampParser

bbaff09

MaxGekk added 3 commits December 14, 2018 10:27

Round trip tests

8af9df9

Renaming test suite

363482e

Added withClue

07e0bf8

Put test under legacy time parser

c12da1f

TODO

60ab5b1

MaxGekk mentioned this pull request Dec 15, 2018

[SPARK-26248][SQL] Infer date type from CSV #23202

Closed

asfgit closed this in 8a27952 Dec 16, 2018

HyukjinKwon reviewed Dec 17, 2018

View reviewed changes

MaxGekk mentioned this pull request Dec 22, 2018

[SPARK-26178][SPARK-26243][SQL][FOLLOWUP] Replacing SimpleDateFormat by DateTimeFormatter in comments #23374

Closed

MaxGekk mentioned this pull request Dec 29, 2018

[SPARK-26002][SQL] Fix day of year calculation for Julian calendar days #23000

Closed

MaxGekk deleted the json-time-parser branch August 17, 2019 13:35

		@@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide

		- Spark applications which are built with Spark version 2.4 and prior, and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 3.0, as they are not binary compatible with Spark 3.0.

		- Since Spark 3.0, JSON datasource uses java.time API for parsing and generating JSON content. New formatting implementation supports date/timestamp patterns conformed to ISO 8601. To switch back to the implementation used in Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`.

	private[this] def castToDate(from: DataType): Any => Any = from match {
	case StringType =>
	buildCast[UTF8String](_, s => DateTimeUtils.stringToDate(s).orNull)
	case TimestampType =>
	// throw valid precision more than seconds, according to Hive.
	// Timestamp.nanos is in 0 to 999,999,999, no more than a second.
	buildCast[Long](_, t => DateTimeUtils.millisToDays(t / 1000L, timeZone))
	}

[SPARK-26243][SQL] Use java.time API for parsing timestamps and dates from JSON #23196

[SPARK-26243][SQL] Use java.time API for parsing timestamps and dates from JSON #23196

Uh oh!

Conversation

MaxGekk commented Dec 1, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 1, 2018

Uh oh!

SparkQA commented Dec 1, 2018

Uh oh!

SparkQA commented Dec 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 5, 2018

Uh oh!

SparkQA commented Dec 5, 2018

Uh oh!

MaxGekk commented Dec 5, 2018

Uh oh!

SparkQA commented Dec 5, 2018

Uh oh!

SparkQA commented Dec 8, 2018

Uh oh!

SparkQA commented Dec 8, 2018

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 13, 2018

Uh oh!

SparkQA commented Dec 13, 2018

Uh oh!

SparkQA commented Dec 14, 2018

Uh oh!

SparkQA commented Dec 14, 2018

Uh oh!

SparkQA commented Dec 14, 2018

Uh oh!

SparkQA commented Dec 15, 2018