-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp #24195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@cloud-fan Please, take a look at the PR. |
@@ -1025,6 +1027,11 @@ case class TimeAdd(start: Expression, interval: Expression, timeZoneId: Option[S | |||
case class FromUTCTimestamp(left: Expression, right: Expression) | |||
extends BinaryExpression with ImplicitCastInputTypes { | |||
|
|||
if (!SQLConf.get.utcTimestampFuncEnabled) { | |||
throw new AnalysisException(s"The $prettyName function has been disabled since Spark 3.0." + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, @MaxGekk . Usually, deprecation means showing warnings instead of Exception
. This looks like a ban to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, deprecation is deprecation. I think we can simply deprecate them in functions.scala
, and that's all.
Wait. Aren't these functions from Hive? We can't drop them if so. I know the semantics are weird but they aren't incoherent, just poorly named |
Test build #103876 has finished for PR 24195 at commit
|
AFAIK hive's timestamp has different semantic from Spark's timestamp, I think it's more than a naming issue. It's arguable if we should deprecate these two functions, but we definitely should not ban them. |
retest this please |
some more thoughts: I think there is really no way to deprecate functions in SQL, because users don't see warning messages. The current approach looks fine to me as a way to deprecate SQL functions, and a flag is needed for sure. LGTM |
@cloud-fan @MaxGekk Sorry if my question is dumb as i don't know the inner workings of the timestamp type (just want to learn). I know that these functions are used frequently. I just looked at this page link. For all the three examples in that page, our result seems to match the expected result ?
|
@cloud-fan this does more than deprecate them, it makes them fail (though with an undocumented flag to reenable). This still feels a bit strong to me. Do we have evidence users are a) not using this at all or b) using it incorrectly? The function seems to work as intended w.r.t. the Hive function it's copying. Why do we really have to remove this? You can say, leave them enabled by default, but then why bother with a flag now? |
@dilipbiswal Let's look at the first example, and assume that we are in the scala> spark.conf.set("spark.sql.session.timeZone", "UTC-11:00") and apply a function which assumes its input of scala> spark.sql("values HOUR(FROM_UTC_TIMESTAMP(TIMESTAMP '2011-12-25 09:00:00.123456', 'Asia/Tokyo'))").show(false)
+----+
|col1|
+----+
|5 |
+----+ The name of the function tells us we converted from a timestamp from UTC, so, in UTC the hour is In general, problems occur when you apply functions or expressions to output of |
I think this is what is happening:
Not 100% sure but that explains the result. It's weird and confusing, but is it the from_utc_timestamp part? Seems like step 2 is the problem. I 'give in' on disabling the confusing function in step 1; its semantics are something even I have to think about 3 times and I even tried to fix the docs once. But is my analysis right and are your other changes helping improve or fix the rest of it? |
Test build #103986 has finished for PR 24195 at commit
|
I have traced what is happening in the example:
It seems we have another issue in folding with wrong time zone here besides of |
long time ago I tried to improve these 2 functions in #21169 but now I can't remember all the details. These 2 functions are confusing as Spark's timestamp is UTC timestamp and users can't associate a timezone to it. IIUC these two functions look correctly if the input is string and we cast the result to string. Otherwise these two functions do not make any sense. |
Heh OK so the steps I guessed were 'right' but in the wrong order, sort of. Yeah might be something else confusing here. I'm not against removing these methods as they are also confusing at best, but probably more going on here indeed. I'm more inclined to support changes now that remove these ambiguities for Spark 3. It's really hard to understand what these do. Note to self for a very future version of Spark: disallow all timezones but UTC internally. Force all timezones issues to input and output as much as possible. Timezones are such a source of errors. |
if (!SQLConf.get.utcTimestampFuncEnabled) { | ||
throw new AnalysisException(s"The $prettyName function has been disabled since Spark 3.0. " + | ||
s"Set ${SQLConf.UTC_TIMESTAMP_FUNC_ENABLED.key} to true to enable this function.") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan, I don't think this is a right way to deprecate. We should then at least rather throw a warning, and or we should better have a better mechanism to deprecate it in SQL side. Do we really want to add every configuration for every deprecation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said, I don't think warning log is the right way to deprecate a SQL function, as users won't see it. This is not good either but this is the best I can think of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we should have the right way first before we start to do it differently from what I have done so far.
Once these functions are deprecated/removed, will there be a way to shift a timestamp into another timezone? I do understand that internally timestamps are in UTC and one can not really shift it to another timezone. However, what if I want to correct a timestamp column? For example:
This use case arises often when dealing with ETL from a system where timestamps are not handled cleanly. How should it be addressed once these functions are gone? |
@adaitche You can shift timestamps by an interval: spark-sql> select timestamp '2019-08-07 10:11:12' + INTERVAL '15' MINUTE;
2019-08-07 10:26:12 |
@MaxGekk Thanks for the quick response. If the source timezone is one with daylight saving (i.e. the offset changes twice a year), for example |
It seems you need a function which takes timestamps and zone id columns and returns zone offset as day-second interval. Like the code:
You can turn on |
@adaitche Please, open JIRA ticket for the feature. |
Hey everyone, I'm actually building a data pipeline right now and I just hit this. I think the solution we went with here is pretty actively hostile to users of Spark. I'm worried that failing this function behind a flag (that has been available since Spark 1.5) just because it has confusing semantics is really going to hinder the adoption of Spark 3.0. I would consider two different types of users here:
Given the above, I think this PR should be reverted until you can at a minimum point the users to something better. |
Yes, let us revert this PR. cc @MaxGekk |
@marmbrus What is the use case when you need to shift UTC timestamps by zone offsets? At the moment, I see only one when you somehow placed non-UTC timestamps in to |
From the JIRA REST I get a string that is a timestamp encoded in PST. My local clock is set to UTC. I would like to store the data in Delta as a UTC timestamp. This function seems to do what I want. Is there a better way to do this in Spark? I do not want to hard code the offset and do interval subtraction. Regardless of the answer to the above, I think we are being hostile to our users with the current approach. I think we should back this out, and any other similar deprecation that are happening in Spark 3.0
The current error messages:
or
These messages tell the user nothing, other than the fact that Spark developers would like to make it hard for them to upgrade their cluster. It doesn't even point them to the JIRA discussion (which IMO is still putting too much cognitive burden on them). Do you think any of our users are going to do anything other than blindly set the flag when they see that message (That is what I did)? I worry that this accomplishes nothing other than frustration. |
TL;DR: these functions are super confusing and so I am sad we had to emulate them from Hive. I think I also was not in favor of actually failing them though, but deprecation sounds OK. that is, I don’t think these functions are particularly good; they’re error prone; but I wouldn’t object to not failing them of course |
I think a good suggestion (Michael already hinted) would be to either not fail, or fail with a helpful message telling user what to do. |
The code freeze of 3.0 is close. So far, we do not have a user-friendly way to replace these deprecated functions. Let us first revert it and then discuss what is the next step #27351 |
I guess if timestamp strings in PST, the strings contain timezone info. Looking at https://developer.atlassian.com/cloud/jira/platform/jira-expressions/#date-and-time, they are.
Spark's TimestampFormatter takes into account time zone info spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala Lines 59 to 60 in c2734ab
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala Lines 61 to 62 in c2734ab
If input strings don't contain time zone info, you can either:
For me, still not clear why did you decide to even look at the functions. |
Regarding the first approach. With zone id string, you have something like Regarding the second approach. In my scenario this is not possible unfortunately. The timezone of the source datetime varies per row. The only working option for this scenario I am aware of is using the I definitely agree that the semantics of these functions are difficult to understand which can lead to mistakes. It took me sometime to understand what they are actually doing. But unfortunately, I am not aware of an alternative. |
You can say to scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> val df = Seq("2019-01-24 11:30:00.123 PST").toDF("ts_str")
df: org.apache.spark.sql.DataFrame = [ts_str: string]
scala> df.select(to_timestamp($"ts_str", "uuuu-MM-dd HH:mm:ss.SSS z")).show(false)
+---------------------------------------------------+
|to_timestamp(`ts_str`, 'uuuu-MM-dd HH:mm:ss.SSS z')|
+---------------------------------------------------+
|2019-01-24 22:30:00.123 |
+---------------------------------------------------+ |
You can append time zone names per each row: scala> val df = Seq(
("2019-01-24 11:30:00.123", "America/Los_Angeles"),
("2020-01-01 01:30:00.123", "PST")).toDF("ts_str", "tz_name")
df: org.apache.spark.sql.DataFrame = [ts_str: string, tz_name: string]
scala> df.select(to_timestamp(concat_ws(" ", $"ts_str", $"tz_name"), "uuuu-MM-dd HH:mm:ss.SSS z").as("timestamp")).show(false)
+-----------------------+
|timestamp |
+-----------------------+
|2019-01-24 22:30:00.123|
|2020-01-01 12:30:00.123|
+-----------------------+ |
@MaxGekk, so are all cases of I think that satisfies what @marmbrus pointed out at #24195 (comment) if I didn't misread. |
I just tried it in PySpark (Spark version 2.4.4). Weirdly it doesn't work for Here is my code: from pyspark.sql import SparkSession
from pyspark.sql.functions import to_timestamp, concat_ws
spark = (SparkSession
.builder
.config("spark.sql.session.timeZone", "UTC")
.getOrCreate()
)
df = spark.createDataFrame(
[
("2019-01-24 11:30:00.123", "America/Los_Angeles"),
("2020-01-01 01:30:00.123", "PST"),
],
["ts_str", "tz_name"]
)
df = df.withColumn("ts_str_tz", concat_ws(" ", "ts_str", "tz_name"))
df = df.withColumn("dt", to_timestamp("ts_str_tz", "yyyy-MM-dd HH:mm:ss.SSS z"))
df.show(10, False) This prints
|
Maybe 2.4 has a bug. I run the example on recent master. Please, fill in a JIRA ticket for 2.4. @adaitche I run your code on master in pyspark. It works fine:
|
@MaxGekk Thanks for pointing out how to parse the datetime strings properly. I have tried it before by appending the timezone with a '+', which worked for offsets but not for timezone-names. Once this bug is fixed, this approach will be an alternative to |
@MaxGekk thanks for showing the examples on parsing timestamps with timezones! I think I will change my code to use this mechanism instead. That said, I remain strongly -1 on the idea that we should require users to set a flag to allow their code that was working to continue to work. I do not think that this is not how the Spark project operates. Look at the RDD API. There is a lot of cruft in there. That, however, is better than the alternative: breaking users programs or forcing them to stay on old versions. I'm less strongly -1 on the deprecation, but I do think that it must include pointers to the correct way to accomplish what the user is trying to do. IMO, this string concatenation suggestion is more of a hack than Given this, and the closeness to Spark 3.0. I still think the correct course of action is to back out this change and come back with a holistic plan for better timestamp handling in Spark. Separate but relatedly, I'm not sure how you figured that magic incantation for |
Hi, All. According to the above discussion, I'll merge the reverting PR (#27351) |
+1 for reverting for now. |
I think the to_timestamp conversion function should really have a per-function override on the default timezone. Appending strings is a hack, and it doesn't work if the input format changes or sometimes has a time zone. Setting a config doesn't work if you combine multiple input sources that have different default time zones (e.g. ETL data from multiple silos across the world), or if later parts of your pipeline rely on other functions that are also affected by this config. |
I suppose make some functions deprecated without clear alternative is not good. I hope the option "spark.sql.legacy.utcTimestampFunc.enabled" will work forever |
@MaxGekk How are we on the rebase function ("get me the same date + hands of the clock in a different timezone") that we talked about at some point? |
What changes were proposed in this pull request?
In the PR, I propose to deprecate the
from_utc_timestamp()
andto_utc_timestamp
, and disable them by default. The functions can be enabled back via the SQL configspark.sql.legacy.utcTimestampFunc.enabled
. By default, any calls of the functions throw an analysis exception.One of the reason for deprecation is functions violate semantic of
TimestampType
which is number of microseconds since epoch in UTC time zone. Shifting microseconds since epoch by time zone offset doesn't make sense because the result doesn't represent microseconds since epoch in UTC time zone any more, and cannot be considered asTimestampType
.How was this patch tested?
The changes were tested by
DateExpressionsSuite
andDateFunctionsSuite
.