[SPARK-23715][SQL] the input of to/from_utc_timestamp can not have timezone #21169

cloud-fan · 2018-04-26T16:08:23Z

What changes were proposed in this pull request?

from_utc_timestamp assumes its input is in UTC timezone and shifts it to the specified timezone. When the timestamp contains timezone(e.g. 2018-03-13T06:18:23+00:00), Spark breaks the semantic and respect the timezone in the string. This is not what user expects and the result is different from Hive/Impala. to_utc_timestamp has the same problem.

More details please refer to the JIRA ticket.

This PR fixes this by returning null if the input timestamp contains timezone.

How was this patch tested?

new tests

cloud-fan · 2018-04-26T16:08:56Z

cc @HyukjinKwon @gatorsmile

SparkQA · 2018-04-26T19:59:22Z

Test build #89893 has finished for PR 21169 at commit 7c1dcc3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class StringToTimestampWithoutTimezone(child: Expression, timeZoneId: Option[String] = None)

bersprockets · 2018-04-26T23:52:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

+  def stringToTimestamp(
+      s: UTF8String,
+      timeZone: TimeZone,
+      forceTimezone: Boolean): Option[SQLTimestamp] = {


It seems more like rejectTzInString or rejectStringTZ.

bersprockets · 2018-04-27T00:19:10Z

sql/core/src/test/resources/sql-tests/inputs/datetime.sql

@@ -27,3 +27,8 @@ select current_date = current_date(), current_timestamp = current_timestamp(), a
 select a, b from ttf2 order by a, current_date;

 select weekday('2007-02-03'), weekday('2009-07-30'), weekday('2017-05-27'), weekday(null), weekday('1582-10-15 13:10:15');
+


Does it matter if there are no success cases for from_utc_timestamp and to_utc_timestamp in here (that is, cases that don't return null)?

bersprockets · 2018-04-27T01:00:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+      // in UTC timezone, and if input is string, it should not contain timezone.
+      // TODO: We should move the type coercion logic to expressions instead of a central
+      // place to put all the rules.
+      case e: FromUTCTimestamp if e.left.dataType == StringType =>


Should these checks go in their own rule that runs before ImplicitTypeCasts?

Catalyst suggests rules in the same batch is order insensitive, since here this rule must be run before implicit type cast, we'd better put them in the same rule to guarantee the order.

cloud-fan · 2018-04-27T04:41:38Z

docs/sql-programming-guide.md

- - Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after promotes both sides to TIMESTAMP. To set `false` to `spark.sql.hive.compareDateTimestampInTimestamp` restores the previous behavior. This option will be removed in Spark 3.0.
- - Since Spark 2.4, creating a managed table with nonempty location is not allowed. An exception is thrown when attempting to create a managed table with nonempty location. To set `true` to `spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the previous behavior. This option will be removed in Spark 3.0.
- - Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception.
+  - Since Spark 2.4, writing an empty dataframe to a directory launches at least one write task, even if physically the dataframe has no partition. This introduces a small behavior change that for self-describing file formats like Parquet and Orc, Spark creates a metadata-only file in the target directory when writing a 0-partition dataframe, so that schema inference can still work if users read that directory later. The new behavior is more reasonable and more consistent regarding writing empty dataframe.


fix indentation

rxin · 2018-04-27T05:40:36Z

docs/sql-programming-guide.md

- - Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception.
+  - Since Spark 2.4, writing an empty dataframe to a directory launches at least one write task, even if physically the dataframe has no partition. This introduces a small behavior change that for self-describing file formats like Parquet and Orc, Spark creates a metadata-only file in the target directory when writing a 0-partition dataframe, so that schema inference can still work if users read that directory later. The new behavior is more reasonable and more consistent regarding writing empty dataframe.
+  - Since Spark 2.4, expression IDs in UDF arguments do not appear in column names. For example, an column name in Spark 2.4 is not `UDF:f(col0 AS colA#28)` but ``UDF:f(col0 AS `colA`)``.
+  - Since Spark 2.4, writing a dataframe with an empty or nested empty schema using any file formats (parquet, orc, json, text, csv etc.) is not allowed. An exception is thrown when attempting to write dataframes with empty schema.


what's a nested empty schema?

like new StructType("empty", new StructType()), the table has a column, the column is struct type but has 0 fields. This schema is invalid to write out.

Anyway this is an existing comment and I just fixed its indentation.

Let us create a dedicated ticket and let the community improve them?

SparkQA · 2018-04-27T07:05:02Z

Test build #89915 has finished for PR 21169 at commit d248d4c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ImplicitTypeCasts(conf: SQLConf) extends TypeCoercionRule

SparkQA · 2018-04-27T07:05:02Z

Test build #89914 has finished for PR 21169 at commit cbe37f2.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ImplicitTypeCasts(conf: SQLConf) extends TypeCoercionRule

bersprockets · 2018-04-27T13:24:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      "contains a timezone part, e.g. `2000-10-10 00:00:00+00:00`.")
+    .booleanConf
+    .createWithDefault(true)
+


Why would we need this? Currently, if a user passes '2000-10-10 00:00:00+00:00' to <to/from>_utc_timestamp, they get the wrong answer. If they switch off this setting, they will continue to get the wrong answer rather than null. Are we accommodating the users who experienced this bug and are manually shifting the result?

existing workloads may depend on the previous behavior and think that's corrected. It's safer to provide an internal conf and tell users about it when they complain about behavior change. It's an internal conf and is invisible to end users.

existing workloads may depend on the previous behavior and think that's corrected

Then I think we need test cases (maybe in DateFunctionsSuite, since those tests would also exercise the type coercion), where REJECT_TIMEZONE_IN_STRING is false.

bersprockets · 2018-04-27T13:32:37Z

sql/core/src/test/resources/sql-tests/results/datetime.sql.out

+-- !query 15 schema
+struct<from_utc_timestamp(CAST(0 AS TIMESTAMP), PST):timestamp>
+-- !query 15 output
+1969-12-31 08:00:00


Since we're adding new SQLConf settings anyway, we could have a "timestamp.hive.compatibility" (or something like that), that is true by default and allows select from_utc_timestamp(cast(0 as timestamp), 'PST') to continue to produce the above answer. However, when false, it would treat 0 as 1970-01-01T00:00:00 UTC, so the above would instead produce the answer '1969-12-31 16:00:00' (which we both agree in the Jira is probably the more correct answer). Just a thought.

I think it's doable, but we should make from_utc_timestamp to accept integral type directly.

SparkQA · 2018-04-27T20:07:11Z

Test build #89931 has finished for PR 21169 at commit e9d1f24.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-04-28T13:00:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

+  }
+
+  /**
+   * Converts a timestamp string to microseconds from the unix epoch, w.r.t. the given timezone.


BTW, I usually avoid abbreviation in doc tho (w.r.t.).

HyukjinKwon · 2018-04-28T13:13:40Z

I think we should fix the doc here too in the APIs to clarify that we reject timezone set in the input, and return null.

HyukjinKwon

LGTM

bersprockets · 2018-04-30T17:46:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      "contains a timezone part, e.g. `2000-10-10 00:00:00+00:00`.")
+    .booleanConf
+    .createWithDefault(true)
+


existing workloads may depend on the previous behavior and think that's corrected

Then I think we need test cases (maybe in DateFunctionsSuite, since those tests would also exercise the type coercion), where REJECT_TIMEZONE_IN_STRING is false.

bersprockets · 2018-04-30T18:21:38Z

docs/sql-programming-guide.md

+  - Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after promotes both sides to TIMESTAMP. To set `false` to `spark.sql.hive.compareDateTimestampInTimestamp` restores the previous behavior. This option will be removed in Spark 3.0.
+  - Since Spark 2.4, creating a managed table with nonempty location is not allowed. An exception is thrown when attempting to create a managed table with nonempty location. To set `true` to `spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the previous behavior. This option will be removed in Spark 3.0.
+  - Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception.
+  - In version 2.3 and earlier, `to_utc_timestamp` and `from_utc_timestamp` respect the timezone in the input timestamp string, which breaks the assumption that the input timestamp is in a specific timezone, and returns weird result. In version 2.4 and later, this problem has been fixed. `to_utc_timestamp` and `from_utc_timestamp` will return null if the input timestamp string contains timezone. As an example, `from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` should return `2000-10-10 01:00:00`. If the input timestamp string contains timezone, e.g. `from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1')`. It returns `2000-10-10 09:00:00` in Spark 2.3(local timezone is GMT+8), and returns null in Spark 2.4. For people who don't care about this problem and want to retain the previous behaivor to keep their query unchanged, you can set `spark.sql.function.rejectTimezoneInString` to false. This option will be removed in Spark 3.0 and should only be used as a temporary workaround.


In version 2.3 and earlier, to_utc_timestamp and from_utc_timestamp respect the timezone in the input timestamp string, which breaks the assumption that the input timestamp is in a specific timezone, and returns weird result.

"and returns weird result" is a bit awkward, but I am not sure my suggestion is much better:

In version 2.3 and earlier, to_utc_timestamp and from_utc_timestamp respect the timezone in the input timestamp string, which breaks the assumption that the input timestamp is in a specific timezone. Therefore, these functions can return unexpected results.

to_utc_timestamp and from_utc_timestamp will return null if the input timestamp string contains timezone. As an example, from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1') should return 2000-10-10 01:00:00. If the input timestamp string contains timezone, e.g. from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1'). It returns 2000-10-10 09:00:00 in Spark 2.3(local timezone is GMT+8), and returns null in Spark 2.4.

Suggestion:

to_utc_timestamp and from_utc_timestamp will return null if the input timestamp string contains a timezone. As an example, from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1') will return 2000-10-10 01:00:00 in both 2.3 and 2.4. However, from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1'), assuming a local timezone of GMT+8, will return 2000-10-10 09:00:00 in 2.3 but null in 2.4.

bersprockets · 2018-04-30T18:25:25Z

Except for my two comments on the documentation and test cases, this looks fine to me (for what's that worth, considering my limited background in Spark).

SparkQA · 2018-05-02T18:04:43Z

Test build #90063 has finished for PR 21169 at commit b6d91db.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

bersprockets · 2018-05-02T21:37:50Z

Addresses all of my comments, thanks.

HyukjinKwon · 2018-05-03T00:03:57Z

retest this please

SparkQA · 2018-05-03T03:42:37Z

Test build #90085 has finished for PR 21169 at commit b6d91db.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-05-03T05:52:58Z

The CRAN issue seems fixed given the response from CRAN sysadmin by @viirya's request. Let me restart this.

HyukjinKwon · 2018-05-03T05:53:06Z

retest this please

SparkQA · 2018-05-03T07:05:02Z

Test build #90099 has finished for PR 21169 at commit b6d91db.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-05-03T07:07:14Z

retest this please

viirya · 2018-05-03T07:07:40Z

retest this please.

SparkQA · 2018-05-03T11:16:36Z

Test build #90105 has finished for PR 21169 at commit b6d91db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-03T11:20:19Z

Test build #90108 has finished for PR 21169 at commit b6d91db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-05-03T11:26:46Z

Merged to master.

gatorsmile · 2018-05-04T00:16:39Z

ping @michal-databricks

michal-databricks · 2018-05-04T13:36:03Z

@gatorsmile I was trying to understand the underlying issue with to/from_utc_timestamp. It took more time than I expected. Here is the outcome.
These functions are not standard and seem to exist in Spark, Hive, Impala and Db2 products originating from Netezza. I did not find the evidence but it seems likely that Impala is the originator of the functions. The problem is that Impala implements TIMESTAMP differently that other systems on the list (Impala way is the standard SQL way). In Spark and Hive the TIMESTAMP always represents a specific UTC bound moment in time, so from/to_utc_timestamp have to resort to tricks to achieve Impala-like behavior. This specific issue is (SPARK-23715) the result of this. Similar problems existed in Hive (HIVE-12706).
As for the actual change proposed, I don't mind it, but also don't think it will be very helpful. The previous behavior in some way was correct (given what the functions actually do and not going by the user documentation). Returning null does not indicate how to solve the problem (I see at least it is documented in the release notes).
Also, if I understand correctly, this solution only works for the specific situation where the input for the function is of string type and the following expression will still return the 'incorrect' result:
from_utc_timestamp('1970-01-01 00:00:00+00:00' + interval '5' minute, 'GMT')
from_utc_timestamp(cast('1970-01-01 00:00:00+00:00' as timestamp), 'GMT')
from_utc_timestamp(timestamp_type_column, 'GMT')

cloud-fan · 2018-05-04T14:01:18Z

@michal-databricks My understanding aligns with yours. I think to/from_utc_timestamp doesn't make sense in Spark because of the way Spark define timestamp. Since Spark already has these 2 functions and document it as Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC ..., we should try our best to simulate this semantic.

If the input is a timestamp value, we are not able to simulate this semantic because Spark timestamp doesn't have timezone. If the input is string, we have a chance to simulate it, by shifting the timezone to local timezone when casting string to timestamp. This PR improves this simulation, returning null is an expected behavior here.

rxin · 2018-09-19T22:23:06Z

i'm actually not sure if we should do this, given impala treats timestamp as timestamp without timezone, whereas spark treats it as a utc timestamp (with timezone). these functions are super confusing anyway and changing them to just match impala in some cases is still very confusing.

cloud-fan · 2018-09-20T01:29:32Z

If we don't care about backward compatibility, I'd like to remove these 2 functions as they don't make sense according to how Spark defines timestamp. This PR is a best effort to make these 2 functions behave like what we documented: Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC ....

SPARK-23715: the input of to/from_utc_timestamp can not have timezone

7c1dcc3

bersprockets reviewed Apr 27, 2018

View reviewed changes

cloud-fan commented Apr 27, 2018

View reviewed changes

address comments

d248d4c

cloud-fan force-pushed the from_utc_timezone branch from cbe37f2 to d248d4c Compare April 27, 2018 04:42

rxin reviewed Apr 27, 2018

View reviewed changes

bersprockets reviewed Apr 27, 2018

View reviewed changes

improve migration guide

e9d1f24

HyukjinKwon reviewed Apr 28, 2018

View reviewed changes

HyukjinKwon approved these changes Apr 29, 2018

View reviewed changes

bersprockets reviewed Apr 30, 2018

View reviewed changes

address comments

b6d91db

asfgit closed this in 417ad92 May 3, 2018

cloud-fan mentioned this pull request Mar 26, 2019

[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp #24195

Closed

		@@ -27,3 +27,8 @@ select current_date = current_date(), current_timestamp = current_timestamp(), a
		select a, b from ttf2 order by a, current_date;

		select weekday('2007-02-03'), weekday('2009-07-30'), weekday('2017-05-27'), weekday(null), weekday('1582-10-15 13:10:15');

[SPARK-23715][SQL] the input of to/from_utc_timestamp can not have timezone #21169

[SPARK-23715][SQL] the input of to/from_utc_timestamp can not have timezone #21169

Uh oh!

Conversation

cloud-fan commented Apr 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Apr 26, 2018

Uh oh!

SparkQA commented Apr 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 27, 2018

Uh oh!

SparkQA commented Apr 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Apr 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bersprockets commented Apr 30, 2018

Uh oh!

SparkQA commented May 2, 2018

Uh oh!

bersprockets commented May 2, 2018

Uh oh!

HyukjinKwon commented May 3, 2018

Uh oh!

SparkQA commented May 3, 2018

Uh oh!

HyukjinKwon commented May 3, 2018

Uh oh!

HyukjinKwon commented May 3, 2018

Uh oh!

SparkQA commented May 3, 2018

Uh oh!

cloud-fan commented May 3, 2018

Uh oh!

viirya commented May 3, 2018

Uh oh!

SparkQA commented May 3, 2018

Uh oh!

SparkQA commented May 3, 2018

cloud-fan commented Apr 26, 2018 •

edited

Loading

HyukjinKwon commented Apr 28, 2018 •

edited

Loading

cloud-fan commented May 4, 2018 •

edited

Loading

cloud-fan commented Sep 20, 2018 •

edited

Loading