-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-23715][SQL] the input of to/from_utc_timestamp can not have timezone #21169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #89893 has finished for PR 21169 at commit
|
def stringToTimestamp( | ||
s: UTF8String, | ||
timeZone: TimeZone, | ||
forceTimezone: Boolean): Option[SQLTimestamp] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems more like rejectTzInString or rejectStringTZ.
@@ -27,3 +27,8 @@ select current_date = current_date(), current_timestamp = current_timestamp(), a | |||
select a, b from ttf2 order by a, current_date; | |||
|
|||
select weekday('2007-02-03'), weekday('2009-07-30'), weekday('2017-05-27'), weekday(null), weekday('1582-10-15 13:10:15'); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it matter if there are no success cases for from_utc_timestamp and to_utc_timestamp in here (that is, cases that don't return null)?
// in UTC timezone, and if input is string, it should not contain timezone. | ||
// TODO: We should move the type coercion logic to expressions instead of a central | ||
// place to put all the rules. | ||
case e: FromUTCTimestamp if e.left.dataType == StringType => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these checks go in their own rule that runs before ImplicitTypeCasts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catalyst suggests rules in the same batch is order insensitive, since here this rule must be run before implicit type cast, we'd better put them in the same rule to guarantee the order.
- Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after promotes both sides to TIMESTAMP. To set `false` to `spark.sql.hive.compareDateTimestampInTimestamp` restores the previous behavior. This option will be removed in Spark 3.0. | ||
- Since Spark 2.4, creating a managed table with nonempty location is not allowed. An exception is thrown when attempting to create a managed table with nonempty location. To set `true` to `spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the previous behavior. This option will be removed in Spark 3.0. | ||
- Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception. | ||
- Since Spark 2.4, writing an empty dataframe to a directory launches at least one write task, even if physically the dataframe has no partition. This introduces a small behavior change that for self-describing file formats like Parquet and Orc, Spark creates a metadata-only file in the target directory when writing a 0-partition dataframe, so that schema inference can still work if users read that directory later. The new behavior is more reasonable and more consistent regarding writing empty dataframe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix indentation
cbe37f2
to
d248d4c
Compare
- Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception. | ||
- Since Spark 2.4, writing an empty dataframe to a directory launches at least one write task, even if physically the dataframe has no partition. This introduces a small behavior change that for self-describing file formats like Parquet and Orc, Spark creates a metadata-only file in the target directory when writing a 0-partition dataframe, so that schema inference can still work if users read that directory later. The new behavior is more reasonable and more consistent regarding writing empty dataframe. | ||
- Since Spark 2.4, expression IDs in UDF arguments do not appear in column names. For example, an column name in Spark 2.4 is not `UDF:f(col0 AS colA#28)` but ``UDF:f(col0 AS `colA`)``. | ||
- Since Spark 2.4, writing a dataframe with an empty or nested empty schema using any file formats (parquet, orc, json, text, csv etc.) is not allowed. An exception is thrown when attempting to write dataframes with empty schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's a nested empty schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like new StructType("empty", new StructType())
, the table has a column, the column is struct type but has 0 fields. This schema is invalid to write out.
Anyway this is an existing comment and I just fixed its indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us create a dedicated ticket and let the community improve them?
Test build #89915 has finished for PR 21169 at commit
|
Test build #89914 has finished for PR 21169 at commit
|
"contains a timezone part, e.g. `2000-10-10 00:00:00+00:00`.") | ||
.booleanConf | ||
.createWithDefault(true) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we need this? Currently, if a user passes '2000-10-10 00:00:00+00:00' to <to/from>_utc_timestamp, they get the wrong answer. If they switch off this setting, they will continue to get the wrong answer rather than null. Are we accommodating the users who experienced this bug and are manually shifting the result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
existing workloads may depend on the previous behavior and think that's corrected. It's safer to provide an internal conf and tell users about it when they complain about behavior change. It's an internal conf and is invisible to end users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
existing workloads may depend on the previous behavior and think that's corrected
Then I think we need test cases (maybe in DateFunctionsSuite, since those tests would also exercise the type coercion), where REJECT_TIMEZONE_IN_STRING is false.
-- !query 15 schema | ||
struct<from_utc_timestamp(CAST(0 AS TIMESTAMP), PST):timestamp> | ||
-- !query 15 output | ||
1969-12-31 08:00:00 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're adding new SQLConf settings anyway, we could have a "timestamp.hive.compatibility" (or something like that), that is true by default and allows select from_utc_timestamp(cast(0 as timestamp), 'PST') to continue to produce the above answer. However, when false, it would treat 0 as 1970-01-01T00:00:00 UTC, so the above would instead produce the answer '1969-12-31 16:00:00' (which we both agree in the Jira is probably the more correct answer). Just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's doable, but we should make from_utc_timestamp
to accept integral type directly.
Test build #89931 has finished for PR 21169 at commit
|
} | ||
|
||
/** | ||
* Converts a timestamp string to microseconds from the unix epoch, w.r.t. the given timezone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I usually avoid abbreviation in doc tho (w.r.t.).
I think we should fix the doc here too in the APIs to clarify that we reject timezone set in the input, and return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
"contains a timezone part, e.g. `2000-10-10 00:00:00+00:00`.") | ||
.booleanConf | ||
.createWithDefault(true) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
existing workloads may depend on the previous behavior and think that's corrected
Then I think we need test cases (maybe in DateFunctionsSuite, since those tests would also exercise the type coercion), where REJECT_TIMEZONE_IN_STRING is false.
docs/sql-programming-guide.md
Outdated
- Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after promotes both sides to TIMESTAMP. To set `false` to `spark.sql.hive.compareDateTimestampInTimestamp` restores the previous behavior. This option will be removed in Spark 3.0. | ||
- Since Spark 2.4, creating a managed table with nonempty location is not allowed. An exception is thrown when attempting to create a managed table with nonempty location. To set `true` to `spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the previous behavior. This option will be removed in Spark 3.0. | ||
- Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception. | ||
- In version 2.3 and earlier, `to_utc_timestamp` and `from_utc_timestamp` respect the timezone in the input timestamp string, which breaks the assumption that the input timestamp is in a specific timezone, and returns weird result. In version 2.4 and later, this problem has been fixed. `to_utc_timestamp` and `from_utc_timestamp` will return null if the input timestamp string contains timezone. As an example, `from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` should return `2000-10-10 01:00:00`. If the input timestamp string contains timezone, e.g. `from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1')`. It returns `2000-10-10 09:00:00` in Spark 2.3(local timezone is GMT+8), and returns null in Spark 2.4. For people who don't care about this problem and want to retain the previous behaivor to keep their query unchanged, you can set `spark.sql.function.rejectTimezoneInString` to false. This option will be removed in Spark 3.0 and should only be used as a temporary workaround. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In version 2.3 and earlier,
to_utc_timestamp
andfrom_utc_timestamp
respect the timezone in the input timestamp string, which breaks the assumption that the input timestamp is in a specific timezone, and returns weird result.
"and returns weird result" is a bit awkward, but I am not sure my suggestion is much better:
In version 2.3 and earlier, to_utc_timestamp
and from_utc_timestamp
respect the timezone in the input timestamp string, which breaks the assumption that the input timestamp is in a specific timezone. Therefore, these functions can return unexpected results.
to_utc_timestamp
andfrom_utc_timestamp
will return null if the input timestamp string contains timezone. As an example,from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')
should return2000-10-10 01:00:00
. If the input timestamp string contains timezone, e.g.from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1')
. It returns2000-10-10 09:00:00
in Spark 2.3(local timezone is GMT+8), and returns null in Spark 2.4.
Suggestion:
to_utc_timestamp
and from_utc_timestamp
will return null if the input timestamp string contains a timezone. As an example, from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')
will return 2000-10-10 01:00:00
in both 2.3 and 2.4. However, from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1')
, assuming a local timezone of GMT+8, will return 2000-10-10 09:00:00
in 2.3 but null
in 2.4.
Except for my two comments on the documentation and test cases, this looks fine to me (for what's that worth, considering my limited background in Spark). |
Test build #90063 has finished for PR 21169 at commit
|
Addresses all of my comments, thanks. |
retest this please |
Test build #90085 has finished for PR 21169 at commit
|
The CRAN issue seems fixed given the response from CRAN sysadmin by @viirya's request. Let me restart this. |
retest this please |
Test build #90099 has finished for PR 21169 at commit
|
retest this please |
retest this please. |
Test build #90105 has finished for PR 21169 at commit
|
Test build #90108 has finished for PR 21169 at commit
|
Merged to master. |
ping @michal-databricks |
@gatorsmile I was trying to understand the underlying issue with to/from_utc_timestamp. It took more time than I expected. Here is the outcome. |
@michal-databricks My understanding aligns with yours. I think If the input is a timestamp value, we are not able to simulate this semantic because Spark timestamp doesn't have timezone. If the input is string, we have a chance to simulate it, by shifting the timezone to local timezone when casting string to timestamp. This PR improves this simulation, returning null is an expected behavior here. |
i'm actually not sure if we should do this, given impala treats timestamp as timestamp without timezone, whereas spark treats it as a utc timestamp (with timezone). these functions are super confusing anyway and changing them to just match impala in some cases is still very confusing. |
If we don't care about backward compatibility, I'd like to remove these 2 functions as they don't make sense according to how Spark defines timestamp. This PR is a best effort to make these 2 functions behave like what we documented: |
What changes were proposed in this pull request?
from_utc_timestamp
assumes its input is in UTC timezone and shifts it to the specified timezone. When the timestamp contains timezone(e.g.2018-03-13T06:18:23+00:00
), Spark breaks the semantic and respect the timezone in the string. This is not what user expects and the result is different from Hive/Impala.to_utc_timestamp
has the same problem.More details please refer to the JIRA ticket.
This PR fixes this by returning null if the input timestamp contains timezone.
How was this patch tested?
new tests