-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-26243][SQL] Use java.time API for parsing timestamps and dates from JSON #23196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #99555 has finished for PR 23196 at commit
|
Test build #99558 has finished for PR 23196 at commit
|
# Conflicts: # docs/sql-migration-guide-upgrade.md
…in second lost because DateType contains only days since epoch
Test build #99571 has finished for PR 23196 at commit
|
docs/sql-migration-guide-upgrade.md
Outdated
@@ -33,6 +33,8 @@ displayTitle: Spark SQL Upgrading Guide | |||
|
|||
- Spark applications which are built with Spark version 2.4 and prior, and call methods of `UserDefinedFunction`, need to be re-compiled with Spark 3.0, as they are not binary compatible with Spark 3.0. | |||
|
|||
- Since Spark 3.0, JSON datasource uses java.time API for parsing and generating JSON content. New formatting implementation supports date/timestamp patterns conformed to ISO 8601. To switch back to the implementation used in Spark 2.4 and earlier, set `spark.sql.legacy.timeParser.enabled` to `true`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The impact is not clearly documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the behavior changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New implementation and old one have slightly different pattern formats. See https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html and https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html . And two Java API can have different behaviours. Besides of that, new one can parse timestamps with microseconds precision as a consequence of using Java 8 java.time API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile What would I you recommend to improve the text? I can add the links above, so, an user can figure out what is difference in their particular case. Our tests don't show any difference on our default timestamp/date patterns but the user can use something more specific and face to behaviour change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can add an example that shows the diff. IIRC it has a difference about exact match or non-exact match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an example when there is a difference, and updated the migration guide.
@@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter { | |||
override def beforeAll() { | |||
super.beforeAll() | |||
TestHive.setCacheTables(true) | |||
// Timezone is fixed to America/Los_Angeles for those timezone sensitive tests (timestamp_*) | |||
TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) | |||
// Timezone is fixed to GMT for those timezone sensitive tests (timestamp_*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk, BTW, why does this have to be GMT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While porting on new parser/formatter, I faced to 2problems at least:
- The time zone from SQL config is not taken into account on parsing at all. Basically used functions take default time zone from jvm settings. It could be fixed by
TimeZone.setDefault
or using absolute values. - Round trip in parsing a date to
DateType
and back to a date as a string could give different string becauseDateType
stores only days (asInt
) since epoch (inUTC
). And such representation loses time zone offset. So, exact matching is impossible due to lack of information. For example, roundtrip converting forTimestampType
works perfectly. This is the case for the changes. Previously, it worked because the specified time zone is not used at all (did not impact on number of days while converting a string toDateType
). With new parser/formatter, it becomes matter, and I have to change time zone toGMT
to eliminate the problem of loosing timezone offsets (it is zero forGMT
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our current approach for converting dates is inconsistent in a few places, for example:
UTF8String
->num days
uses hardcodedGMT
and ignores SQL config:spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
Line 493 in f982ca0
val c = threadLocalGmtCalendar.get() String
->java.util.Date
ignores Spark's time zone settings, and uses system time zone:spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
Line 186 in f982ca0
Date.valueOf(s) - In many places even a function accepts timeZone parameter, it is not passed (used default time zone - not from config but from TimeZone.getDefault()). For example:
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
Line 187 in 36edbac
DateTimeUtils.dateToString(DateTimeUtils.fromJavaDate(d)) - Casting to the date type depends on type of argument, if it is
TimestampType
, expression-wise timezone is used, otherwiseGMT
:spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Lines 403 to 410 in d03e0af
private[this] def castToDate(from: DataType): Any => Any = from match { case StringType => buildCast[UTF8String](_, s => DateTimeUtils.stringToDate(s).orNull) case TimestampType => // throw valid precision more than seconds, according to Hive. // Timestamp.nanos is in 0 to 999,999,999, no more than a second. buildCast[Long](_, t => DateTimeUtils.millisToDays(t / 1000L, timeZone)) }
I do really think to disable new parser/formatter outside of CSV/JSON datasources because it is hard to guarantee consistent behavior in combination with other date/timestamp functions. @srowen @gatorsmile @HyukjinKwon WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think consistency is indeed a problem, but why disable the new parser, rather than make this consistent? I haven't looked into whether there's a good reason they behave differently but suspect not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests passed on new parser. I reverted back all settings for HiveCompatibilitySuite
Test build #99685 has finished for PR 23196 at commit
|
Test build #99714 has finished for PR 23196 at commit
|
jenkins, retest this, please |
Test build #99734 has finished for PR 23196 at commit
|
Test build #99847 has finished for PR 23196 at commit
|
Test build #99848 has finished for PR 23196 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
Outdated
Show resolved
Hide resolved
new Random(System.nanoTime()) | ||
).getOrElse { | ||
fail(s"Failed to create data generator for schema $dataType") | ||
withSQLConf(SQLConf.SESSION_LOCAL_TIMEZONE.key -> "UTC") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried here. This test is a round-trip test, do you mean if we write out a date/timestamp to json and read it back, the values will be different if session timezone is not UTC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be same, if the session local timezone doesn't change between write and read back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried here. This test is a round-trip test ...
It should be same, if the session local timezone doesn't change between write and read back.
Not only JSON parser/formatter involved in the loop but also converting milliseconds to Java's Timestamp
and to something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
converting milliseconds to Java's Timestamp and to something else.
these don't matter once the dataframe is created.
The problem is, if we have a dataframe(no matter how it is generated), we write it out and read it back. If it becomes different, we have a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me look at it deeper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove the timezone setting here? Then we can look at jenkens report and see which seed can reproduce the bug and debug it locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see which seed can reproduce the bug and debug it locally.
I ran it locally many times. It is almost 100% reproducible for any seed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about to put the test under the flag spark.sql.legacy.timeParser.enabled
and create a separate JIRA ticket? I would believe the bug somewhere in Spark's home made date/time functions rather than Java 8 implementation of timestamps parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. Can you create the ticket? And put a TODO here which refers to the ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the ticket: https://issues.apache.org/jira/browse/SPARK-26374 and I added TODO
Test build #100091 has finished for PR 23196 at commit
|
Test build #100103 has started for PR 23196 at commit |
Test build #100146 has finished for PR 23196 at commit
|
Test build #100152 has finished for PR 23196 at commit
|
Test build #100159 has finished for PR 23196 at commit
|
Test build #100185 has finished for PR 23196 at commit
|
thanks, merging to master! |
@@ -28,31 +28,44 @@ import org.apache.commons.lang3.time.FastDateFormat | |||
|
|||
import org.apache.spark.sql.internal.SQLConf | |||
|
|||
sealed trait DateTimeFormatter { | |||
sealed trait TimestampFormatter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we name it TimestampFormatter
? It has DateFormatter
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have another trait: DateFormatter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, sorry I mean the file name @cloud-fan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix it in #23329
…by DateTimeFormatter in comments ## What changes were proposed in this pull request? The PRs #23150 and #23196 switched JSON and CSV datasources on new formatter for dates/timestamps which is based on `DateTimeFormatter`. In this PR, I replaced `SimpleDateFormat` by `DateTimeFormatter` to reflect the changes. Closes #23374 from MaxGekk/java-time-docs. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
… from JSON ## What changes were proposed in this pull request? In the PR, I propose to switch on **java.time API** for parsing timestamps and dates from JSON inputs with microseconds precision. The SQL config `spark.sql.legacy.timeParser.enabled` allow to switch back to previous behavior with using `java.text.SimpleDateFormat`/`FastDateFormat` for parsing/generating timestamps/dates. ## How was this patch tested? It was tested by `JsonExpressionsSuite`, `JsonFunctionsSuite` and `JsonSuite`. Closes apache#23196 from MaxGekk/json-time-parser. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…by DateTimeFormatter in comments ## What changes were proposed in this pull request? The PRs apache#23150 and apache#23196 switched JSON and CSV datasources on new formatter for dates/timestamps which is based on `DateTimeFormatter`. In this PR, I replaced `SimpleDateFormat` by `DateTimeFormatter` to reflect the changes. Closes apache#23374 from MaxGekk/java-time-docs. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
… from JSON ## What changes were proposed in this pull request? In the PR, I propose to switch on **java.time API** for parsing timestamps and dates from JSON inputs with microseconds precision. The SQL config `spark.sql.legacy.timeParser.enabled` allow to switch back to previous behavior with using `java.text.SimpleDateFormat`/`FastDateFormat` for parsing/generating timestamps/dates. ## How was this patch tested? It was tested by `JsonExpressionsSuite`, `JsonFunctionsSuite` and `JsonSuite`. Closes apache#23196 from MaxGekk/json-time-parser. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…by DateTimeFormatter in comments ## What changes were proposed in this pull request? The PRs apache#23150 and apache#23196 switched JSON and CSV datasources on new formatter for dates/timestamps which is based on `DateTimeFormatter`. In this PR, I replaced `SimpleDateFormat` by `DateTimeFormatter` to reflect the changes. Closes apache#23374 from MaxGekk/java-time-docs. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
In the PR, I propose to switch on java.time API for parsing timestamps and dates from JSON inputs with microseconds precision. The SQL config
spark.sql.legacy.timeParser.enabled
allow to switch back to previous behavior with usingjava.text.SimpleDateFormat
/FastDateFormat
for parsing/generating timestamps/dates.How was this patch tested?
It was tested by
JsonExpressionsSuite
,JsonFunctionsSuite
andJsonSuite
.