[SPARK-30960][SQL] add back the legacy date/timestamp format support in CSV/JSON parser #27710

cloud-fan · 2020-02-26T15:13:03Z

What changes were proposed in this pull request?

Before Spark 3.0, the JSON/CSV parser has a special behavior that, when the parser fails to parse a timestamp/date, fallback to another way to parse it, to support some legacy format. The fallback was removed by https://issues.apache.org/jira/browse/SPARK-26178 and https://issues.apache.org/jira/browse/SPARK-26243.

This PR adds back this legacy fallback. Since we switch the API to do datetime operations, we can't be exactly the same as before. Here we add back the support of the legacy formats that are common (examples of Spark 2.4):

the fields can have one or two letters

scala> sql("""select from_json('{"time":"1123-2-22 2:22:22"}', 'time Timestamp')""").show(false)
+-------------------------------------------+
|jsontostructs({"time":"1123-2-22 2:22:22"})|
+-------------------------------------------+
|[1123-02-22 02:22:22]                      |
+-------------------------------------------+

the separator between data and time can be "T" as well

scala> sql("""select from_json('{"time":"2000-12-12T12:12:12"}', 'time Timestamp')""").show(false)
+---------------------------------------------+
|jsontostructs({"time":"2000-12-12T12:12:12"})|
+---------------------------------------------+
|[2000-12-12 12:12:12]                        |
+---------------------------------------------+

the second fraction can be arbitrary length

scala> sql("""select from_json('{"time":"1123-02-22T02:22:22.123456789123"}', 'time Timestamp')""").show(false)
+----------------------------------------------------------+
|jsontostructs({"time":"1123-02-22T02:22:22.123456789123"})|
+----------------------------------------------------------+
|[1123-02-15 02:22:22.123]                                 |
+----------------------------------------------------------+

date string can end up with any chars after "T" or space

scala> sql("""select from_json('{"time":"1123-02-22Tabc"}', 'time date')""").show(false)
+----------------------------------------+
|jsontostructs({"time":"1123-02-22Tabc"})|
+----------------------------------------+
|[1123-02-22]                            |
+----------------------------------------+

remove "GMT" from the string before parsing

scala> sql("""select from_json('{"time":"GMT1123-2-22 2:22:22.123"}', 'time Timestamp')""").show(false)
+--------------------------------------------------+
|jsontostructs({"time":"GMT1123-2-22 2:22:22.123"})|
+--------------------------------------------------+
|[1123-02-22 02:22:22.123]                         |
+--------------------------------------------------+

Why are the changes needed?

It doesn't hurt to keep this legacy support. It just makes the parsing more relaxed.

Does this PR introduce any user-facing change?

yes, to make 3.0 support parsing most of the csv/json values that were supported before.

How was this patch tested?

new tests

cloud-fan · 2020-02-26T15:14:17Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala

@@ -318,4 +319,35 @@ class UnivocityParserSuite extends SparkFunSuite with SQLHelper {
    }.getMessage
    assert(errMsg2.contains("i does not exist"))
  }
+
+  test("SPARK-30960: parse date/timestamp string with legacy format") {


Unfortunately there is no UT for the json parser. The test-ability of the json parser is not good...

cloud-fan · 2020-02-26T15:15:04Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

@@ -1447,6 +1449,107 @@ abstract class JsonSuite extends QueryTest with SharedSparkSession with TestJson
    })
  }

+  test("backward compatibility") {


This test case was removed in 3.0. Now add it back.

cloud-fan · 2020-02-26T15:15:40Z

cc @MaxGekk @HyukjinKwon @srowen @xuanyuanking

MaxGekk · 2020-02-26T15:51:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

+            // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
+            // compatibility.
+            val str = UTF8String.fromString(datum)
+            DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)


Looking at what was removed there https://github.com/apache/spark/pull/23150/files#diff-c82e4b74d2a51fed29069745ce4f9e96L164 , DateTimeUtils.stringToTime and DateTimeUtils.stringToTimestamp are 2 different functions, see https://github.com/MaxGekk/spark/blob/60c0974261c947c0838457c40f4fe0e64d17ca15/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L173-L191

stringToTime has a few problems - it doesn't respect to Spark's session time zone, and parses in the combined calendar. If you use it as a fallback, you can get parsed values in different calendars.

Yea I just want to support the legacy format, not to be exactly the same with 2.4. We switch the calendar so we can't be exactly the same anyway.

If we can't be exactly the same, why do we add this back?
cc @marmbrus and @gatorsmile

We can't be exactly the same in 1% cases, and this change can make us exactly the same in 99% cases.

What cases used to be supported that are no longer supported?

What remains in the 1%?

Why can we not get 100%?

To me, Spark suddenly failing to parse timestamps that it used to parse fall into the "extremely annoying" category of regressions. Working with timestamps is already hard and confusing. If my job just suddenly starts returning null when it used to work as expected its:

Probably going to take a while for people to figure it out (while the null partition of their table mysteriously fills up).

Probably non-trival for them to figure out how to fix it. (I don't think constructing / testing format strings is a trivial activity).

I really question why we changed these API's rather than creating new better ones. These changes broke my pipeline and it took me hours + access to a team of spark committers to figure out why!!!!!

stringToTimestamp cannot parse time zone id with the GMT prefix, see https://github.com/MaxGekk/spark/blob/60c0974261c947c0838457c40f4fe0e64d17ca15/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L176

looks like a corner case, but let's add it back as well.

@cloud-fan Here is a draft PR which supports any zone ids by stringToTimestamp: #27753

Thank you for keeping moving (improving) this situation!

@MaxGekk that's a good complement to support more legacy formats!

SparkQA · 2020-02-26T17:22:57Z

Test build #118979 has finished for PR 27710 at commit 766fb50.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-26T22:21:26Z

Test build #118988 has finished for PR 27710 at commit 7613d54.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-27T21:44:26Z

Test build #119048 has finished for PR 27710 at commit 16db32f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2020-03-03T00:55:36Z

cc @xuanyuanking

xuanyuanking · 2020-03-03T01:55:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

+    if (indexOfGMT != -1) {
+      // ISO8601 with a weird time zone specifier (2000-01-01T00:00GMT+01:00)
+      val s0 = s.substring(0, indexOfGMT)
+      val s1 = s.substring(indexOfGMT + 3)


There are 3 cases for GMT formatting, GMT+8, GMT+08:00 and GMT, do we need to check all in UnivocityParserSuite?

This will be fixed by #27753

Oh yeah, got it thanks.

xuanyuanking · 2020-03-03T02:18:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

-      nullSafeDatum(d, name, nullable, options)(timestampFormatter.parse)
+      nullSafeDatum(d, name, nullable, options) { datum =>
+        try {
+          timestampFormatter.parse(datum)


Should we also shadow this fallback behavior under legacy config? Or JSON/CSV will not keep the same behavior with the SQL side?
As the current approach, it seems to break the rule we want to achieve in #27537: throw an exception when the result changing between old and new Spark versions.

I don't think it should be protected by a config.

The fallback was there at the very beginning without any config, and I think it's reasonable to support the legacy format always, to make the parser more relaxed.

Copy, that makes sense.
So this fallback logic is kind of guard logic for the parser, no matter the parser is new or legacy one.
After this merged, #27537 need to address the logic conflict.

xuanyuanking

LGTM

…in CSV/JSON parser ### What changes were proposed in this pull request? Before Spark 3.0, the JSON/CSV parser has a special behavior that, when the parser fails to parse a timestamp/date, fallback to another way to parse it, to support some legacy format. The fallback was removed by https://issues.apache.org/jira/browse/SPARK-26178 and https://issues.apache.org/jira/browse/SPARK-26243. This PR adds back this legacy fallback. Since we switch the API to do datetime operations, we can't be exactly the same as before. Here we add back the support of the legacy formats that are common (examples of Spark 2.4): 1. the fields can have one or two letters ``` scala> sql("""select from_json('{"time":"1123-2-22 2:22:22"}', 'time Timestamp')""").show(false) +-------------------------------------------+ |jsontostructs({"time":"1123-2-22 2:22:22"})| +-------------------------------------------+ |[1123-02-22 02:22:22] | +-------------------------------------------+ ``` 2. the separator between data and time can be "T" as well ``` scala> sql("""select from_json('{"time":"2000-12-12T12:12:12"}', 'time Timestamp')""").show(false) +---------------------------------------------+ |jsontostructs({"time":"2000-12-12T12:12:12"})| +---------------------------------------------+ |[2000-12-12 12:12:12] | +---------------------------------------------+ ``` 3. the second fraction can be arbitrary length ``` scala> sql("""select from_json('{"time":"1123-02-22T02:22:22.123456789123"}', 'time Timestamp')""").show(false) +----------------------------------------------------------+ |jsontostructs({"time":"1123-02-22T02:22:22.123456789123"})| +----------------------------------------------------------+ |[1123-02-15 02:22:22.123] | +----------------------------------------------------------+ ``` 4. date string can end up with any chars after "T" or space ``` scala> sql("""select from_json('{"time":"1123-02-22Tabc"}', 'time date')""").show(false) +----------------------------------------+ |jsontostructs({"time":"1123-02-22Tabc"})| +----------------------------------------+ |[1123-02-22] | +----------------------------------------+ ``` 5. remove "GMT" from the string before parsing ``` scala> sql("""select from_json('{"time":"GMT1123-2-22 2:22:22.123"}', 'time Timestamp')""").show(false) +--------------------------------------------------+ |jsontostructs({"time":"GMT1123-2-22 2:22:22.123"})| +--------------------------------------------------+ |[1123-02-22 02:22:22.123] | +--------------------------------------------------+ ``` ### Why are the changes needed? It doesn't hurt to keep this legacy support. It just makes the parsing more relaxed. ### Does this PR introduce any user-facing change? yes, to make 3.0 support parsing most of the csv/json values that were supported before. ### How was this patch tested? new tests Closes #27710 from cloud-fan/bug2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit e4c61e3) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

HyukjinKwon · 2020-03-04T09:28:33Z

Merged to master and branch-3.0.

dongjoon-hyun · 2020-03-04T17:09:01Z

Thank you all!

…amps ### What changes were proposed in this pull request? In the PR, I propose to change `DateTimeUtils.stringToTimestamp` to support any valid time zone id at the end of input string. After the changes, the function accepts zone ids in the formats: - no zone id. In that case, the function uses the local session time zone from the SQL config `spark.sql.session.timeZone` - -[h]h:[m]m - +[h]h:[m]m - Z - Short zone id, see https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html#SHORT_IDS - Zone ID starts with 'UTC+', 'UTC-', 'GMT+', 'GMT-', 'UT+' or 'UT-'. The ID is split in two, with a two or three letter prefix and a suffix starting with the sign. The suffix must be in the formats: - +|-h[h] - +|-hh[:]mm - +|-hh:mm:ss - +|-hhmmss - Region-based zone IDs in the form `{area}/{city}`, such as `Europe/Paris` or `America/New_York`. The default set of region ids is supplied by the IANA Time Zone Database (TZDB). ### Why are the changes needed? - To use `stringToTimestamp` as a substitution of removed `stringToTime`, see apache#27710 (comment) - Improve UX of Spark SQL by allowing flexible formats of zone ids. Currently, Spark accepts only `Z` and zone offsets that can be inconvenient when a time zone offset is shifted due to daylight saving rules. For instance: ```sql spark-sql> select cast('2015-03-18T12:03:17.123456 Europe/Moscow' as timestamp); NULL ``` ### Does this PR introduce any user-facing change? Yes. After the changes, casting strings to timestamps allows time zone id at the end of the strings: ```sql spark-sql> select cast('2015-03-18T12:03:17.123456 Europe/Moscow' as timestamp); 2015-03-18 12:03:17.123456 ``` ### How was this patch tested? - Added new test cases to the `string to timestamp` test in `DateTimeUtilsSuite`. - Run `CastSuite` and `AnsiCastSuite`. Closes apache#27753 from MaxGekk/stringToTimestamp-uni-zoneId. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…amps ### What changes were proposed in this pull request? In the PR, I propose to change `DateTimeUtils.stringToTimestamp` to support any valid time zone id at the end of input string. After the changes, the function accepts zone ids in the formats: - no zone id. In that case, the function uses the local session time zone from the SQL config `spark.sql.session.timeZone` - -[h]h:[m]m - +[h]h:[m]m - Z - Short zone id, see https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html#SHORT_IDS - Zone ID starts with 'UTC+', 'UTC-', 'GMT+', 'GMT-', 'UT+' or 'UT-'. The ID is split in two, with a two or three letter prefix and a suffix starting with the sign. The suffix must be in the formats: - +|-h[h] - +|-hh[:]mm - +|-hh:mm:ss - +|-hhmmss - Region-based zone IDs in the form `{area}/{city}`, such as `Europe/Paris` or `America/New_York`. The default set of region ids is supplied by the IANA Time Zone Database (TZDB). ### Why are the changes needed? - To use `stringToTimestamp` as a substitution of removed `stringToTime`, see #27710 (comment) - Improve UX of Spark SQL by allowing flexible formats of zone ids. Currently, Spark accepts only `Z` and zone offsets that can be inconvenient when a time zone offset is shifted due to daylight saving rules. For instance: ```sql spark-sql> select cast('2015-03-18T12:03:17.123456 Europe/Moscow' as timestamp); NULL ``` ### Does this PR introduce any user-facing change? Yes. After the changes, casting strings to timestamps allows time zone id at the end of the strings: ```sql spark-sql> select cast('2015-03-18T12:03:17.123456 Europe/Moscow' as timestamp); 2015-03-18 12:03:17.123456 ``` ### How was this patch tested? - Added new test cases to the `string to timestamp` test in `DateTimeUtilsSuite`. - Run `CastSuite` and `AnsiCastSuite`. Closes #27753 from MaxGekk/stringToTimestamp-uni-zoneId. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1fd9a91) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…in CSV/JSON parser ### What changes were proposed in this pull request? Before Spark 3.0, the JSON/CSV parser has a special behavior that, when the parser fails to parse a timestamp/date, fallback to another way to parse it, to support some legacy format. The fallback was removed by https://issues.apache.org/jira/browse/SPARK-26178 and https://issues.apache.org/jira/browse/SPARK-26243. This PR adds back this legacy fallback. Since we switch the API to do datetime operations, we can't be exactly the same as before. Here we add back the support of the legacy formats that are common (examples of Spark 2.4): 1. the fields can have one or two letters ``` scala> sql("""select from_json('{"time":"1123-2-22 2:22:22"}', 'time Timestamp')""").show(false) +-------------------------------------------+ |jsontostructs({"time":"1123-2-22 2:22:22"})| +-------------------------------------------+ |[1123-02-22 02:22:22] | +-------------------------------------------+ ``` 2. the separator between data and time can be "T" as well ``` scala> sql("""select from_json('{"time":"2000-12-12T12:12:12"}', 'time Timestamp')""").show(false) +---------------------------------------------+ |jsontostructs({"time":"2000-12-12T12:12:12"})| +---------------------------------------------+ |[2000-12-12 12:12:12] | +---------------------------------------------+ ``` 3. the second fraction can be arbitrary length ``` scala> sql("""select from_json('{"time":"1123-02-22T02:22:22.123456789123"}', 'time Timestamp')""").show(false) +----------------------------------------------------------+ |jsontostructs({"time":"1123-02-22T02:22:22.123456789123"})| +----------------------------------------------------------+ |[1123-02-15 02:22:22.123] | +----------------------------------------------------------+ ``` 4. date string can end up with any chars after "T" or space ``` scala> sql("""select from_json('{"time":"1123-02-22Tabc"}', 'time date')""").show(false) +----------------------------------------+ |jsontostructs({"time":"1123-02-22Tabc"})| +----------------------------------------+ |[1123-02-22] | +----------------------------------------+ ``` 5. remove "GMT" from the string before parsing ``` scala> sql("""select from_json('{"time":"GMT1123-2-22 2:22:22.123"}', 'time Timestamp')""").show(false) +--------------------------------------------------+ |jsontostructs({"time":"GMT1123-2-22 2:22:22.123"})| +--------------------------------------------------+ |[1123-02-22 02:22:22.123] | +--------------------------------------------------+ ``` ### Why are the changes needed? It doesn't hurt to keep this legacy support. It just makes the parsing more relaxed. ### Does this PR introduce any user-facing change? yes, to make 3.0 support parsing most of the csv/json values that were supported before. ### How was this patch tested? new tests Closes apache#27710 from cloud-fan/bug2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…amps ### What changes were proposed in this pull request? In the PR, I propose to change `DateTimeUtils.stringToTimestamp` to support any valid time zone id at the end of input string. After the changes, the function accepts zone ids in the formats: - no zone id. In that case, the function uses the local session time zone from the SQL config `spark.sql.session.timeZone` - -[h]h:[m]m - +[h]h:[m]m - Z - Short zone id, see https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html#SHORT_IDS - Zone ID starts with 'UTC+', 'UTC-', 'GMT+', 'GMT-', 'UT+' or 'UT-'. The ID is split in two, with a two or three letter prefix and a suffix starting with the sign. The suffix must be in the formats: - +|-h[h] - +|-hh[:]mm - +|-hh:mm:ss - +|-hhmmss - Region-based zone IDs in the form `{area}/{city}`, such as `Europe/Paris` or `America/New_York`. The default set of region ids is supplied by the IANA Time Zone Database (TZDB). ### Why are the changes needed? - To use `stringToTimestamp` as a substitution of removed `stringToTime`, see apache#27710 (comment) - Improve UX of Spark SQL by allowing flexible formats of zone ids. Currently, Spark accepts only `Z` and zone offsets that can be inconvenient when a time zone offset is shifted due to daylight saving rules. For instance: ```sql spark-sql> select cast('2015-03-18T12:03:17.123456 Europe/Moscow' as timestamp); NULL ``` ### Does this PR introduce any user-facing change? Yes. After the changes, casting strings to timestamps allows time zone id at the end of the strings: ```sql spark-sql> select cast('2015-03-18T12:03:17.123456 Europe/Moscow' as timestamp); 2015-03-18 12:03:17.123456 ``` ### How was this patch tested? - Added new test cases to the `string to timestamp` test in `DateTimeUtilsSuite`. - Run `CastSuite` and `AnsiCastSuite`. Closes apache#27753 from MaxGekk/stringToTimestamp-uni-zoneId. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

MaxGekk · 2020-05-01T21:53:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala

+                // In Spark 1.5.0, we store the data as number of days since epoch in string.
+                // So, we just convert it to Int.
+                try {
+                  parser.getText.toInt


Should we rebase this?

good catch! I think we should.

Here is the PR #28453

cloud-fan commented Feb 26, 2020

View reviewed changes

MaxGekk reviewed Feb 26, 2020

View reviewed changes

dongjoon-hyun added the SQL label Feb 26, 2020

add back the legacy date/timestamp format support in CSV/JSON parser

7613d54

cloud-fan force-pushed the bug2 branch from 766fb50 to 7613d54 Compare February 26, 2020 18:06

gatorsmile mentioned this pull request Feb 27, 2020

[SPARK-26503][CORE] Get rid of spark.sql.legacy.timeParser.enabled #23495

Closed

address comments

16db32f

MaxGekk mentioned this pull request Mar 2, 2020

[SPARK-31005][SQL] Support time zone ids in casting strings to timestamps #27753

Closed

xuanyuanking reviewed Mar 3, 2020

View reviewed changes

xuanyuanking approved these changes Mar 3, 2020

View reviewed changes

HyukjinKwon approved these changes Mar 4, 2020

View reviewed changes

HyukjinKwon closed this in e4c61e3 Mar 4, 2020

MaxGekk reviewed May 1, 2020

View reviewed changes

[SPARK-30960][SQL] add back the legacy date/timestamp format support in CSV/JSON parser #27710

[SPARK-30960][SQL] add back the legacy date/timestamp format support in CSV/JSON parser #27710

Uh oh!

Conversation

cloud-fan commented Feb 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Feb 26, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Feb 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 26, 2020

Uh oh!

SparkQA commented Feb 26, 2020

Uh oh!

SparkQA commented Feb 27, 2020

Uh oh!

gatorsmile commented Mar 3, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuanyuanking left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Mar 4, 2020

Uh oh!

dongjoon-hyun commented Mar 4, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cloud-fan commented Feb 26, 2020 •

edited

Loading

dongjoon-hyun Feb 26, 2020 •

edited

Loading

dongjoon-hyun Mar 2, 2020 •

edited

Loading