-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-30960][SQL] add back the legacy date/timestamp format support in CSV/JSON parser #27710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -318,4 +319,35 @@ class UnivocityParserSuite extends SparkFunSuite with SQLHelper { | |||
}.getMessage | |||
assert(errMsg2.contains("i does not exist")) | |||
} | |||
|
|||
test("SPARK-30960: parse date/timestamp string with legacy format") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately there is no UT for the json parser. The test-ability of the json parser is not good...
@@ -1447,6 +1449,107 @@ abstract class JsonSuite extends QueryTest with SharedSparkSession with TestJson | |||
}) | |||
} | |||
|
|||
test("backward compatibility") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case was removed in 3.0. Now add it back.
// If fails to parse, then tries the way used in 2.0 and 1.x for backwards | ||
// compatibility. | ||
val str = UTF8String.fromString(datum) | ||
DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at what was removed there https://github.com/apache/spark/pull/23150/files#diff-c82e4b74d2a51fed29069745ce4f9e96L164 , DateTimeUtils.stringToTime
and DateTimeUtils.stringToTimestamp
are 2 different functions, see https://github.com/MaxGekk/spark/blob/60c0974261c947c0838457c40f4fe0e64d17ca15/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L173-L191
stringToTime
has a few problems - it doesn't respect to Spark's session time zone, and parses in the combined calendar. If you use it as a fallback, you can get parsed values in different calendars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I just want to support the legacy format, not to be exactly the same with 2.4. We switch the calendar so we can't be exactly the same anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can't be exactly the same, why do we add this back?
cc @marmbrus and @gatorsmile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't be exactly the same in 1% cases, and this change can make us exactly the same in 99% cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What cases used to be supported that are no longer supported?
- What remains in the 1%?
- Why can we not get 100%?
To me, Spark suddenly failing to parse timestamps that it used to parse fall into the "extremely annoying" category of regressions. Working with timestamps is already hard and confusing. If my job just suddenly starts returning null
when it used to work as expected its:
- Probably going to take a while for people to figure it out (while the
null
partition of their table mysteriously fills up). - Probably non-trival for them to figure out how to fix it. (I don't think constructing / testing format strings is a trivial activity).
I really question why we changed these API's rather than creating new better ones. These changes broke my pipeline and it took me hours + access to a team of spark committers to figure out why!!!!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stringToTimestamp
cannot parse time zone id with the GMT
prefix, see https://github.com/MaxGekk/spark/blob/60c0974261c947c0838457c40f4fe0e64d17ca15/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L176
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like a corner case, but let's add it back as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Here is a draft PR which supports any zone ids by stringToTimestamp
: #27753
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for keeping moving (improving) this situation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk that's a good complement to support more legacy formats!
Test build #118979 has finished for PR 27710 at commit
|
Test build #118988 has finished for PR 27710 at commit
|
Test build #119048 has finished for PR 27710 at commit
|
if (indexOfGMT != -1) { | ||
// ISO8601 with a weird time zone specifier (2000-01-01T00:00GMT+01:00) | ||
val s0 = s.substring(0, indexOfGMT) | ||
val s1 = s.substring(indexOfGMT + 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 3 cases for GMT formatting, GMT+8
, GMT+08:00
and GMT
, do we need to check all in UnivocityParserSuite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be fixed by #27753
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, got it thanks.
nullSafeDatum(d, name, nullable, options)(timestampFormatter.parse) | ||
nullSafeDatum(d, name, nullable, options) { datum => | ||
try { | ||
timestampFormatter.parse(datum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also shadow this fallback behavior under legacy config? Or JSON/CSV will not keep the same behavior with the SQL side?
As the current approach, it seems to break the rule we want to achieve in #27537: throw an exception when the result changing between old and new Spark versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it should be protected by a config.
The fallback was there at the very beginning without any config, and I think it's reasonable to support the legacy format always, to make the parser more relaxed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy, that makes sense.
So this fallback logic is kind of guard logic for the parser, no matter the parser is new or legacy one.
After this merged, #27537 need to address the logic conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…in CSV/JSON parser ### What changes were proposed in this pull request? Before Spark 3.0, the JSON/CSV parser has a special behavior that, when the parser fails to parse a timestamp/date, fallback to another way to parse it, to support some legacy format. The fallback was removed by https://issues.apache.org/jira/browse/SPARK-26178 and https://issues.apache.org/jira/browse/SPARK-26243. This PR adds back this legacy fallback. Since we switch the API to do datetime operations, we can't be exactly the same as before. Here we add back the support of the legacy formats that are common (examples of Spark 2.4): 1. the fields can have one or two letters ``` scala> sql("""select from_json('{"time":"1123-2-22 2:22:22"}', 'time Timestamp')""").show(false) +-------------------------------------------+ |jsontostructs({"time":"1123-2-22 2:22:22"})| +-------------------------------------------+ |[1123-02-22 02:22:22] | +-------------------------------------------+ ``` 2. the separator between data and time can be "T" as well ``` scala> sql("""select from_json('{"time":"2000-12-12T12:12:12"}', 'time Timestamp')""").show(false) +---------------------------------------------+ |jsontostructs({"time":"2000-12-12T12:12:12"})| +---------------------------------------------+ |[2000-12-12 12:12:12] | +---------------------------------------------+ ``` 3. the second fraction can be arbitrary length ``` scala> sql("""select from_json('{"time":"1123-02-22T02:22:22.123456789123"}', 'time Timestamp')""").show(false) +----------------------------------------------------------+ |jsontostructs({"time":"1123-02-22T02:22:22.123456789123"})| +----------------------------------------------------------+ |[1123-02-15 02:22:22.123] | +----------------------------------------------------------+ ``` 4. date string can end up with any chars after "T" or space ``` scala> sql("""select from_json('{"time":"1123-02-22Tabc"}', 'time date')""").show(false) +----------------------------------------+ |jsontostructs({"time":"1123-02-22Tabc"})| +----------------------------------------+ |[1123-02-22] | +----------------------------------------+ ``` 5. remove "GMT" from the string before parsing ``` scala> sql("""select from_json('{"time":"GMT1123-2-22 2:22:22.123"}', 'time Timestamp')""").show(false) +--------------------------------------------------+ |jsontostructs({"time":"GMT1123-2-22 2:22:22.123"})| +--------------------------------------------------+ |[1123-02-22 02:22:22.123] | +--------------------------------------------------+ ``` ### Why are the changes needed? It doesn't hurt to keep this legacy support. It just makes the parsing more relaxed. ### Does this PR introduce any user-facing change? yes, to make 3.0 support parsing most of the csv/json values that were supported before. ### How was this patch tested? new tests Closes #27710 from cloud-fan/bug2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit e4c61e3) Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Merged to master and branch-3.0. |
Thank you all! |
…amps ### What changes were proposed in this pull request? In the PR, I propose to change `DateTimeUtils.stringToTimestamp` to support any valid time zone id at the end of input string. After the changes, the function accepts zone ids in the formats: - no zone id. In that case, the function uses the local session time zone from the SQL config `spark.sql.session.timeZone` - -[h]h:[m]m - +[h]h:[m]m - Z - Short zone id, see https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html#SHORT_IDS - Zone ID starts with 'UTC+', 'UTC-', 'GMT+', 'GMT-', 'UT+' or 'UT-'. The ID is split in two, with a two or three letter prefix and a suffix starting with the sign. The suffix must be in the formats: - +|-h[h] - +|-hh[:]mm - +|-hh:mm:ss - +|-hhmmss - Region-based zone IDs in the form `{area}/{city}`, such as `Europe/Paris` or `America/New_York`. The default set of region ids is supplied by the IANA Time Zone Database (TZDB). ### Why are the changes needed? - To use `stringToTimestamp` as a substitution of removed `stringToTime`, see apache#27710 (comment) - Improve UX of Spark SQL by allowing flexible formats of zone ids. Currently, Spark accepts only `Z` and zone offsets that can be inconvenient when a time zone offset is shifted due to daylight saving rules. For instance: ```sql spark-sql> select cast('2015-03-18T12:03:17.123456 Europe/Moscow' as timestamp); NULL ``` ### Does this PR introduce any user-facing change? Yes. After the changes, casting strings to timestamps allows time zone id at the end of the strings: ```sql spark-sql> select cast('2015-03-18T12:03:17.123456 Europe/Moscow' as timestamp); 2015-03-18 12:03:17.123456 ``` ### How was this patch tested? - Added new test cases to the `string to timestamp` test in `DateTimeUtilsSuite`. - Run `CastSuite` and `AnsiCastSuite`. Closes apache#27753 from MaxGekk/stringToTimestamp-uni-zoneId. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…amps ### What changes were proposed in this pull request? In the PR, I propose to change `DateTimeUtils.stringToTimestamp` to support any valid time zone id at the end of input string. After the changes, the function accepts zone ids in the formats: - no zone id. In that case, the function uses the local session time zone from the SQL config `spark.sql.session.timeZone` - -[h]h:[m]m - +[h]h:[m]m - Z - Short zone id, see https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html#SHORT_IDS - Zone ID starts with 'UTC+', 'UTC-', 'GMT+', 'GMT-', 'UT+' or 'UT-'. The ID is split in two, with a two or three letter prefix and a suffix starting with the sign. The suffix must be in the formats: - +|-h[h] - +|-hh[:]mm - +|-hh:mm:ss - +|-hhmmss - Region-based zone IDs in the form `{area}/{city}`, such as `Europe/Paris` or `America/New_York`. The default set of region ids is supplied by the IANA Time Zone Database (TZDB). ### Why are the changes needed? - To use `stringToTimestamp` as a substitution of removed `stringToTime`, see #27710 (comment) - Improve UX of Spark SQL by allowing flexible formats of zone ids. Currently, Spark accepts only `Z` and zone offsets that can be inconvenient when a time zone offset is shifted due to daylight saving rules. For instance: ```sql spark-sql> select cast('2015-03-18T12:03:17.123456 Europe/Moscow' as timestamp); NULL ``` ### Does this PR introduce any user-facing change? Yes. After the changes, casting strings to timestamps allows time zone id at the end of the strings: ```sql spark-sql> select cast('2015-03-18T12:03:17.123456 Europe/Moscow' as timestamp); 2015-03-18 12:03:17.123456 ``` ### How was this patch tested? - Added new test cases to the `string to timestamp` test in `DateTimeUtilsSuite`. - Run `CastSuite` and `AnsiCastSuite`. Closes #27753 from MaxGekk/stringToTimestamp-uni-zoneId. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1fd9a91) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…in CSV/JSON parser ### What changes were proposed in this pull request? Before Spark 3.0, the JSON/CSV parser has a special behavior that, when the parser fails to parse a timestamp/date, fallback to another way to parse it, to support some legacy format. The fallback was removed by https://issues.apache.org/jira/browse/SPARK-26178 and https://issues.apache.org/jira/browse/SPARK-26243. This PR adds back this legacy fallback. Since we switch the API to do datetime operations, we can't be exactly the same as before. Here we add back the support of the legacy formats that are common (examples of Spark 2.4): 1. the fields can have one or two letters ``` scala> sql("""select from_json('{"time":"1123-2-22 2:22:22"}', 'time Timestamp')""").show(false) +-------------------------------------------+ |jsontostructs({"time":"1123-2-22 2:22:22"})| +-------------------------------------------+ |[1123-02-22 02:22:22] | +-------------------------------------------+ ``` 2. the separator between data and time can be "T" as well ``` scala> sql("""select from_json('{"time":"2000-12-12T12:12:12"}', 'time Timestamp')""").show(false) +---------------------------------------------+ |jsontostructs({"time":"2000-12-12T12:12:12"})| +---------------------------------------------+ |[2000-12-12 12:12:12] | +---------------------------------------------+ ``` 3. the second fraction can be arbitrary length ``` scala> sql("""select from_json('{"time":"1123-02-22T02:22:22.123456789123"}', 'time Timestamp')""").show(false) +----------------------------------------------------------+ |jsontostructs({"time":"1123-02-22T02:22:22.123456789123"})| +----------------------------------------------------------+ |[1123-02-15 02:22:22.123] | +----------------------------------------------------------+ ``` 4. date string can end up with any chars after "T" or space ``` scala> sql("""select from_json('{"time":"1123-02-22Tabc"}', 'time date')""").show(false) +----------------------------------------+ |jsontostructs({"time":"1123-02-22Tabc"})| +----------------------------------------+ |[1123-02-22] | +----------------------------------------+ ``` 5. remove "GMT" from the string before parsing ``` scala> sql("""select from_json('{"time":"GMT1123-2-22 2:22:22.123"}', 'time Timestamp')""").show(false) +--------------------------------------------------+ |jsontostructs({"time":"GMT1123-2-22 2:22:22.123"})| +--------------------------------------------------+ |[1123-02-22 02:22:22.123] | +--------------------------------------------------+ ``` ### Why are the changes needed? It doesn't hurt to keep this legacy support. It just makes the parsing more relaxed. ### Does this PR introduce any user-facing change? yes, to make 3.0 support parsing most of the csv/json values that were supported before. ### How was this patch tested? new tests Closes apache#27710 from cloud-fan/bug2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…amps ### What changes were proposed in this pull request? In the PR, I propose to change `DateTimeUtils.stringToTimestamp` to support any valid time zone id at the end of input string. After the changes, the function accepts zone ids in the formats: - no zone id. In that case, the function uses the local session time zone from the SQL config `spark.sql.session.timeZone` - -[h]h:[m]m - +[h]h:[m]m - Z - Short zone id, see https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html#SHORT_IDS - Zone ID starts with 'UTC+', 'UTC-', 'GMT+', 'GMT-', 'UT+' or 'UT-'. The ID is split in two, with a two or three letter prefix and a suffix starting with the sign. The suffix must be in the formats: - +|-h[h] - +|-hh[:]mm - +|-hh:mm:ss - +|-hhmmss - Region-based zone IDs in the form `{area}/{city}`, such as `Europe/Paris` or `America/New_York`. The default set of region ids is supplied by the IANA Time Zone Database (TZDB). ### Why are the changes needed? - To use `stringToTimestamp` as a substitution of removed `stringToTime`, see apache#27710 (comment) - Improve UX of Spark SQL by allowing flexible formats of zone ids. Currently, Spark accepts only `Z` and zone offsets that can be inconvenient when a time zone offset is shifted due to daylight saving rules. For instance: ```sql spark-sql> select cast('2015-03-18T12:03:17.123456 Europe/Moscow' as timestamp); NULL ``` ### Does this PR introduce any user-facing change? Yes. After the changes, casting strings to timestamps allows time zone id at the end of the strings: ```sql spark-sql> select cast('2015-03-18T12:03:17.123456 Europe/Moscow' as timestamp); 2015-03-18 12:03:17.123456 ``` ### How was this patch tested? - Added new test cases to the `string to timestamp` test in `DateTimeUtilsSuite`. - Run `CastSuite` and `AnsiCastSuite`. Closes apache#27753 from MaxGekk/stringToTimestamp-uni-zoneId. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
// In Spark 1.5.0, we store the data as number of days since epoch in string. | ||
// So, we just convert it to Int. | ||
try { | ||
parser.getText.toInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rebase this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! I think we should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the PR #28453
What changes were proposed in this pull request?
Before Spark 3.0, the JSON/CSV parser has a special behavior that, when the parser fails to parse a timestamp/date, fallback to another way to parse it, to support some legacy format. The fallback was removed by https://issues.apache.org/jira/browse/SPARK-26178 and https://issues.apache.org/jira/browse/SPARK-26243.
This PR adds back this legacy fallback. Since we switch the API to do datetime operations, we can't be exactly the same as before. Here we add back the support of the legacy formats that are common (examples of Spark 2.4):
Why are the changes needed?
It doesn't hurt to keep this legacy support. It just makes the parsing more relaxed.
Does this PR introduce any user-facing change?
yes, to make 3.0 support parsing most of the csv/json values that were supported before.
How was this patch tested?
new tests