Skip to content

[SPARK-23715][SQL] the input of to/from_utc_timestamp can not have timezone #21169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Apr 26, 2018

What changes were proposed in this pull request?

from_utc_timestamp assumes its input is in UTC timezone and shifts it to the specified timezone. When the timestamp contains timezone(e.g. 2018-03-13T06:18:23+00:00), Spark breaks the semantic and respect the timezone in the string. This is not what user expects and the result is different from Hive/Impala. to_utc_timestamp has the same problem.

More details please refer to the JIRA ticket.

This PR fixes this by returning null if the input timestamp contains timezone.

How was this patch tested?

new tests

@cloud-fan
Copy link
Contributor Author

cc @HyukjinKwon @gatorsmile

@SparkQA
Copy link

SparkQA commented Apr 26, 2018

Test build #89893 has finished for PR 21169 at commit 7c1dcc3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class StringToTimestampWithoutTimezone(child: Expression, timeZoneId: Option[String] = None)

def stringToTimestamp(
s: UTF8String,
timeZone: TimeZone,
forceTimezone: Boolean): Option[SQLTimestamp] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems more like rejectTzInString or rejectStringTZ.

@@ -27,3 +27,8 @@ select current_date = current_date(), current_timestamp = current_timestamp(), a
select a, b from ttf2 order by a, current_date;

select weekday('2007-02-03'), weekday('2009-07-30'), weekday('2017-05-27'), weekday(null), weekday('1582-10-15 13:10:15');

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter if there are no success cases for from_utc_timestamp and to_utc_timestamp in here (that is, cases that don't return null)?

// in UTC timezone, and if input is string, it should not contain timezone.
// TODO: We should move the type coercion logic to expressions instead of a central
// place to put all the rules.
case e: FromUTCTimestamp if e.left.dataType == StringType =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these checks go in their own rule that runs before ImplicitTypeCasts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catalyst suggests rules in the same batch is order insensitive, since here this rule must be run before implicit type cast, we'd better put them in the same rule to guarantee the order.

- Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after promotes both sides to TIMESTAMP. To set `false` to `spark.sql.hive.compareDateTimestampInTimestamp` restores the previous behavior. This option will be removed in Spark 3.0.
- Since Spark 2.4, creating a managed table with nonempty location is not allowed. An exception is thrown when attempting to create a managed table with nonempty location. To set `true` to `spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the previous behavior. This option will be removed in Spark 3.0.
- Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception.
- Since Spark 2.4, writing an empty dataframe to a directory launches at least one write task, even if physically the dataframe has no partition. This introduces a small behavior change that for self-describing file formats like Parquet and Orc, Spark creates a metadata-only file in the target directory when writing a 0-partition dataframe, so that schema inference can still work if users read that directory later. The new behavior is more reasonable and more consistent regarding writing empty dataframe.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix indentation

- Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception.
- Since Spark 2.4, writing an empty dataframe to a directory launches at least one write task, even if physically the dataframe has no partition. This introduces a small behavior change that for self-describing file formats like Parquet and Orc, Spark creates a metadata-only file in the target directory when writing a 0-partition dataframe, so that schema inference can still work if users read that directory later. The new behavior is more reasonable and more consistent regarding writing empty dataframe.
- Since Spark 2.4, expression IDs in UDF arguments do not appear in column names. For example, an column name in Spark 2.4 is not `UDF:f(col0 AS colA#28)` but ``UDF:f(col0 AS `colA`)``.
- Since Spark 2.4, writing a dataframe with an empty or nested empty schema using any file formats (parquet, orc, json, text, csv etc.) is not allowed. An exception is thrown when attempting to write dataframes with empty schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's a nested empty schema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like new StructType("empty", new StructType()), the table has a column, the column is struct type but has 0 fields. This schema is invalid to write out.

Anyway this is an existing comment and I just fixed its indentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us create a dedicated ticket and let the community improve them?

@SparkQA
Copy link

SparkQA commented Apr 27, 2018

Test build #89915 has finished for PR 21169 at commit d248d4c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ImplicitTypeCasts(conf: SQLConf) extends TypeCoercionRule

@SparkQA
Copy link

SparkQA commented Apr 27, 2018

Test build #89914 has finished for PR 21169 at commit cbe37f2.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ImplicitTypeCasts(conf: SQLConf) extends TypeCoercionRule

"contains a timezone part, e.g. `2000-10-10 00:00:00+00:00`.")
.booleanConf
.createWithDefault(true)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we need this? Currently, if a user passes '2000-10-10 00:00:00+00:00' to <to/from>_utc_timestamp, they get the wrong answer. If they switch off this setting, they will continue to get the wrong answer rather than null. Are we accommodating the users who experienced this bug and are manually shifting the result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

existing workloads may depend on the previous behavior and think that's corrected. It's safer to provide an internal conf and tell users about it when they complain about behavior change. It's an internal conf and is invisible to end users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

existing workloads may depend on the previous behavior and think that's corrected

Then I think we need test cases (maybe in DateFunctionsSuite, since those tests would also exercise the type coercion), where REJECT_TIMEZONE_IN_STRING is false.

-- !query 15 schema
struct<from_utc_timestamp(CAST(0 AS TIMESTAMP), PST):timestamp>
-- !query 15 output
1969-12-31 08:00:00
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're adding new SQLConf settings anyway, we could have a "timestamp.hive.compatibility" (or something like that), that is true by default and allows select from_utc_timestamp(cast(0 as timestamp), 'PST') to continue to produce the above answer. However, when false, it would treat 0 as 1970-01-01T00:00:00 UTC, so the above would instead produce the answer '1969-12-31 16:00:00' (which we both agree in the Jira is probably the more correct answer). Just a thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's doable, but we should make from_utc_timestamp to accept integral type directly.

@SparkQA
Copy link

SparkQA commented Apr 27, 2018

Test build #89931 has finished for PR 21169 at commit e9d1f24.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

/**
* Converts a timestamp string to microseconds from the unix epoch, w.r.t. the given timezone.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I usually avoid abbreviation in doc tho (w.r.t.).

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Apr 28, 2018

I think we should fix the doc here too in the APIs to clarify that we reject timezone set in the input, and return null.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

"contains a timezone part, e.g. `2000-10-10 00:00:00+00:00`.")
.booleanConf
.createWithDefault(true)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

existing workloads may depend on the previous behavior and think that's corrected

Then I think we need test cases (maybe in DateFunctionsSuite, since those tests would also exercise the type coercion), where REJECT_TIMEZONE_IN_STRING is false.

- Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after promotes both sides to TIMESTAMP. To set `false` to `spark.sql.hive.compareDateTimestampInTimestamp` restores the previous behavior. This option will be removed in Spark 3.0.
- Since Spark 2.4, creating a managed table with nonempty location is not allowed. An exception is thrown when attempting to create a managed table with nonempty location. To set `true` to `spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the previous behavior. This option will be removed in Spark 3.0.
- Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception.
- In version 2.3 and earlier, `to_utc_timestamp` and `from_utc_timestamp` respect the timezone in the input timestamp string, which breaks the assumption that the input timestamp is in a specific timezone, and returns weird result. In version 2.4 and later, this problem has been fixed. `to_utc_timestamp` and `from_utc_timestamp` will return null if the input timestamp string contains timezone. As an example, `from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` should return `2000-10-10 01:00:00`. If the input timestamp string contains timezone, e.g. `from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1')`. It returns `2000-10-10 09:00:00` in Spark 2.3(local timezone is GMT+8), and returns null in Spark 2.4. For people who don't care about this problem and want to retain the previous behaivor to keep their query unchanged, you can set `spark.sql.function.rejectTimezoneInString` to false. This option will be removed in Spark 3.0 and should only be used as a temporary workaround.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In version 2.3 and earlier, to_utc_timestamp and from_utc_timestamp respect the timezone in the input timestamp string, which breaks the assumption that the input timestamp is in a specific timezone, and returns weird result.

"and returns weird result" is a bit awkward, but I am not sure my suggestion is much better:

In version 2.3 and earlier, to_utc_timestamp and from_utc_timestamp respect the timezone in the input timestamp string, which breaks the assumption that the input timestamp is in a specific timezone. Therefore, these functions can return unexpected results.

to_utc_timestamp and from_utc_timestamp will return null if the input timestamp string contains timezone. As an example, from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1') should return 2000-10-10 01:00:00. If the input timestamp string contains timezone, e.g. from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1'). It returns 2000-10-10 09:00:00 in Spark 2.3(local timezone is GMT+8), and returns null in Spark 2.4.

Suggestion:

to_utc_timestamp and from_utc_timestamp will return null if the input timestamp string contains a timezone. As an example, from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1') will return 2000-10-10 01:00:00 in both 2.3 and 2.4. However, from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1'), assuming a local timezone of GMT+8, will return 2000-10-10 09:00:00 in 2.3 but null in 2.4.

@bersprockets
Copy link
Contributor

Except for my two comments on the documentation and test cases, this looks fine to me (for what's that worth, considering my limited background in Spark).

@SparkQA
Copy link

SparkQA commented May 2, 2018

Test build #90063 has finished for PR 21169 at commit b6d91db.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@bersprockets
Copy link
Contributor

Addresses all of my comments, thanks.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented May 3, 2018

Test build #90085 has finished for PR 21169 at commit b6d91db.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

The CRAN issue seems fixed given the response from CRAN sysadmin by @viirya's request. Let me restart this.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented May 3, 2018

Test build #90099 has finished for PR 21169 at commit b6d91db.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@viirya
Copy link
Member

viirya commented May 3, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented May 3, 2018

Test build #90105 has finished for PR 21169 at commit b6d91db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 3, 2018

Test build #90108 has finished for PR 21169 at commit b6d91db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in 417ad92 May 3, 2018
@gatorsmile
Copy link
Member

ping @michal-databricks

@michal-databricks
Copy link
Contributor

@gatorsmile I was trying to understand the underlying issue with to/from_utc_timestamp. It took more time than I expected. Here is the outcome.
These functions are not standard and seem to exist in Spark, Hive, Impala and Db2 products originating from Netezza. I did not find the evidence but it seems likely that Impala is the originator of the functions. The problem is that Impala implements TIMESTAMP differently that other systems on the list (Impala way is the standard SQL way). In Spark and Hive the TIMESTAMP always represents a specific UTC bound moment in time, so from/to_utc_timestamp have to resort to tricks to achieve Impala-like behavior. This specific issue is (SPARK-23715) the result of this. Similar problems existed in Hive (HIVE-12706).
As for the actual change proposed, I don't mind it, but also don't think it will be very helpful. The previous behavior in some way was correct (given what the functions actually do and not going by the user documentation). Returning null does not indicate how to solve the problem (I see at least it is documented in the release notes).
Also, if I understand correctly, this solution only works for the specific situation where the input for the function is of string type and the following expression will still return the 'incorrect' result:
from_utc_timestamp('1970-01-01 00:00:00+00:00' + interval '5' minute, 'GMT')
from_utc_timestamp(cast('1970-01-01 00:00:00+00:00' as timestamp), 'GMT')
from_utc_timestamp(timestamp_type_column, 'GMT')

@cloud-fan
Copy link
Contributor Author

cloud-fan commented May 4, 2018

@michal-databricks My understanding aligns with yours. I think to/from_utc_timestamp doesn't make sense in Spark because of the way Spark define timestamp. Since Spark already has these 2 functions and document it as Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC ..., we should try our best to simulate this semantic.

If the input is a timestamp value, we are not able to simulate this semantic because Spark timestamp doesn't have timezone. If the input is string, we have a chance to simulate it, by shifting the timezone to local timezone when casting string to timestamp. This PR improves this simulation, returning null is an expected behavior here.

@rxin
Copy link
Contributor

rxin commented Sep 19, 2018

i'm actually not sure if we should do this, given impala treats timestamp as timestamp without timezone, whereas spark treats it as a utc timestamp (with timezone). these functions are super confusing anyway and changing them to just match impala in some cases is still very confusing.

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Sep 20, 2018

If we don't care about backward compatibility, I'd like to remove these 2 functions as they don't make sense according to how Spark defines timestamp. This PR is a best effort to make these 2 functions behave like what we documented: Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC ....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants