Skip to content

[SPARK-26243][SQL] Use java.time API for parsing timestamps and dates from JSON #23196

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 32 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Dec 1, 2018

What changes were proposed in this pull request?

In the PR, I propose to switch on java.time API for parsing timestamps and dates from JSON inputs with microseconds precision. The SQL config spark.sql.legacy.timeParser.enabled allow to switch back to previous behavior with using java.text.SimpleDateFormat/FastDateFormat for parsing/generating timestamps/dates.

How was this patch tested?

It was tested by JsonExpressionsSuite, JsonFunctionsSuite and JsonSuite.

@SparkQA
Copy link

SparkQA commented Dec 1, 2018

Test build #99555 has finished for PR 23196 at commit 4646ded.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 1, 2018

Test build #99558 has finished for PR 23196 at commit f326042.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 2, 2018

Test build #99571 has finished for PR 23196 at commit 55f2eac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide

- Spark applications which are built with Spark version 2.4 and prior, and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 3.0, as they are not binary compatible with Spark 3.0.

- Since Spark 3.0, JSON datasource uses java.time API for parsing and generating JSON content. New formatting implementation supports date/timestamp patterns conformed to ISO 8601. To switch back to the implementation used in Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The impact is not clearly documented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the behavior changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New implementation and old one have slightly different pattern formats. See https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html and https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html . And two Java API can have different behaviours. Besides of that, new one can parse timestamps with microseconds precision as a consequence of using Java 8 java.time API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile What would I you recommend to improve the text? I can add the links above, so, an user can figure out what is difference in their particular case. Our tests don't show any difference on our default timestamp/date patterns but the user can use something more specific and face to behaviour change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add an example that shows the diff. IIRC it has a difference about exact match or non-exact match.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an example when there is a difference, and updated the migration guide.

@@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter {
override def beforeAll() {
super.beforeAll()
TestHive.setCacheTables(true)
// Timezone is fixed to America/Los_Angeles for those timezone sensitive tests (timestamp_*)
TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
// Timezone is fixed to GMT for those timezone sensitive tests (timestamp_*)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk, BTW, why does this have to be GMT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While porting on new parser/formatter, I faced to 2problems at least:

  • The time zone from SQL config is not taken into account on parsing at all. Basically used functions take default time zone from jvm settings. It could be fixed by TimeZone.setDefault or using absolute values.
  • Round trip in parsing a date to DateType and back to a date as a string could give different string because DateType stores only days (as Int) since epoch (in UTC). And such representation loses time zone offset. So, exact matching is impossible due to lack of information. For example, roundtrip converting for TimestampType works perfectly. This is the case for the changes. Previously, it worked because the specified time zone is not used at all (did not impact on number of days while converting a string to DateType). With new parser/formatter, it becomes matter, and I have to change time zone to GMT to eliminate the problem of loosing timezone offsets (it is zero for GMT).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our current approach for converting dates is inconsistent in a few places, for example:

I do really think to disable new parser/formatter outside of CSV/JSON datasources because it is hard to guarantee consistent behavior in combination with other date/timestamp functions. @srowen @gatorsmile @HyukjinKwon WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think consistency is indeed a problem, but why disable the new parser, rather than make this consistent? I haven't looked into whether there's a good reason they behave differently but suspect not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests passed on new parser. I reverted back all settings for HiveCompatibilitySuite

@SparkQA
Copy link

SparkQA commented Dec 5, 2018

Test build #99685 has finished for PR 23196 at commit 57600e2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • logError(s\"Failed to load class $childMainClass.\")
  • class CSVInferSchema(val options: CSVOptions) extends Serializable
  • class InterpretedSafeProjection(expressions: Seq[Expression]) extends Projection

@SparkQA
Copy link

SparkQA commented Dec 5, 2018

Test build #99714 has finished for PR 23196 at commit 07fcf46.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 5, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Dec 5, 2018

Test build #99734 has finished for PR 23196 at commit 07fcf46.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 8, 2018

Test build #99847 has finished for PR 23196 at commit 6b6ea8a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 8, 2018

Test build #99848 has finished for PR 23196 at commit 244654b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

new Random(System.nanoTime())
).getOrElse {
fail(s"Failed to create data generator for schema $dataType")
withSQLConf(SQLConf.SESSION_LOCAL_TIMEZONE.key -> "UTC") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried here. This test is a round-trip test, do you mean if we write out a date/timestamp to json and read it back, the values will be different if session timezone is not UTC?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be same, if the session local timezone doesn't change between write and read back.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried here. This test is a round-trip test ...
It should be same, if the session local timezone doesn't change between write and read back.

Not only JSON parser/formatter involved in the loop but also converting milliseconds to Java's Timestamp and to something else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

converting milliseconds to Java's Timestamp and to something else.

these don't matter once the dataframe is created.

The problem is, if we have a dataframe(no matter how it is generated), we write it out and read it back. If it becomes different, we have a bug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me look at it deeper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove the timezone setting here? Then we can look at jenkens report and see which seed can reproduce the bug and debug it locally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see which seed can reproduce the bug and debug it locally.

I ran it locally many times. It is almost 100% reproducible for any seed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about to put the test under the flag spark.sql.legacy.timeParser.enabled and create a separate JIRA ticket? I would believe the bug somewhere in Spark's home made date/time functions rather than Java 8 implementation of timestamps parsing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. Can you create the ticket? And put a TODO here which refers to the ticket.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the ticket: https://issues.apache.org/jira/browse/SPARK-26374 and I added TODO

@SparkQA
Copy link

SparkQA commented Dec 13, 2018

Test build #100091 has finished for PR 23196 at commit 0c7b96b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 13, 2018

Test build #100103 has started for PR 23196 at commit bbaff09.

@SparkQA
Copy link

SparkQA commented Dec 14, 2018

Test build #100146 has finished for PR 23196 at commit 363482e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DateTimestampFormatterSuite extends SparkFunSuite with SQLHelper

@SparkQA
Copy link

SparkQA commented Dec 14, 2018

Test build #100152 has finished for PR 23196 at commit 07e0bf8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 14, 2018

Test build #100159 has finished for PR 23196 at commit c12da1f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 15, 2018

Test build #100185 has finished for PR 23196 at commit 60ab5b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 8a27952 Dec 16, 2018
@@ -28,31 +28,44 @@ import org.apache.commons.lang3.time.FastDateFormat

import org.apache.spark.sql.internal.SQLConf

sealed trait DateTimeFormatter {
sealed trait TimestampFormatter {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we name it TimestampFormatter? It has DateFormatter as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have another trait: DateFormatter

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, sorry I mean the file name @cloud-fan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix it in #23329

asfgit pushed a commit that referenced this pull request Dec 24, 2018
…by DateTimeFormatter in comments

## What changes were proposed in this pull request?

The PRs #23150 and #23196 switched JSON and CSV datasources on new formatter for dates/timestamps which is based on `DateTimeFormatter`. In this PR, I replaced `SimpleDateFormat` by `DateTimeFormatter` to reflect the changes.

Closes #23374 from MaxGekk/java-time-docs.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
holdenk pushed a commit to holdenk/spark that referenced this pull request Jan 5, 2019
… from JSON

## What changes were proposed in this pull request?

In the PR, I propose to switch on **java.time API** for parsing timestamps and dates from JSON inputs with microseconds precision. The SQL config `spark.sql.legacy.timeParser.enabled` allow to switch back to previous behavior with using `java.text.SimpleDateFormat`/`FastDateFormat` for parsing/generating timestamps/dates.

## How was this patch tested?

It was tested by `JsonExpressionsSuite`, `JsonFunctionsSuite` and `JsonSuite`.

Closes apache#23196 from MaxGekk/json-time-parser.

Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
holdenk pushed a commit to holdenk/spark that referenced this pull request Jan 5, 2019
…by DateTimeFormatter in comments

## What changes were proposed in this pull request?

The PRs apache#23150 and apache#23196 switched JSON and CSV datasources on new formatter for dates/timestamps which is based on `DateTimeFormatter`. In this PR, I replaced `SimpleDateFormat` by `DateTimeFormatter` to reflect the changes.

Closes apache#23374 from MaxGekk/java-time-docs.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
… from JSON

## What changes were proposed in this pull request?

In the PR, I propose to switch on **java.time API** for parsing timestamps and dates from JSON inputs with microseconds precision. The SQL config `spark.sql.legacy.timeParser.enabled` allow to switch back to previous behavior with using `java.text.SimpleDateFormat`/`FastDateFormat` for parsing/generating timestamps/dates.

## How was this patch tested?

It was tested by `JsonExpressionsSuite`, `JsonFunctionsSuite` and `JsonSuite`.

Closes apache#23196 from MaxGekk/json-time-parser.

Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…by DateTimeFormatter in comments

## What changes were proposed in this pull request?

The PRs apache#23150 and apache#23196 switched JSON and CSV datasources on new formatter for dates/timestamps which is based on `DateTimeFormatter`. In this PR, I replaced `SimpleDateFormat` by `DateTimeFormatter` to reflect the changes.

Closes apache#23374 from MaxGekk/java-time-docs.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@MaxGekk MaxGekk deleted the json-time-parser branch August 17, 2019 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants