-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-31469][SQL] Make extract interval field ANSI compliance #28242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @cloud-fan @dongjoon-hyun @MaxGekk @HyukjinKwon thanks |
We should not assume that 1 month = 30 days, so agree to not support For millennium/century/decade, looks not useful and they were added for pgsql compatibility. We can add them later if there are user requests. LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the standard says what's second
is? Is it just seconds or seconds + its fraction?
According to the standard, for |
Test build #121407 has finished for PR 28242 at commit
|
thanks, merging to master/3.0! |
Test build #121406 has finished for PR 28242 at commit
|
### What changes were proposed in this pull request? Currently, we can extract `millennium/century/decade/year/quarter/month/week/day/hour/minute/second(with fractions)//millisecond/microseconds` and `epoch` from interval values While getting the `millennium/century/decade/year`, it means how many the interval `months` part can be converted to that unit-value. The content of `millennium/century/decade` will overlap `year` and each other. While getting `month/day` and so on, it means the integral remainder of the previous unit. Here all the units including `year` are individual. So while extracting `year`, `month`, `day`, `hour`, `minute`, `second`, which are ANSI primary datetime units, the semantic is `extracting`, but others might refer to `transforming`. While getting epoch we have treat month as 30 days which varies the natural Calendar rules we use. To avoid ambiguity, I suggest we should only support those extract field defined ANSI with their abbreviations. ### Why are the changes needed? Extracting `millennium`, `century` etc does not obey the meaning of extracting, and they are not so useful and worth maintaining. The `extract` is ANSI standard expression and `date_part` is its pg-specific alias function. The current support extract-fields are fully bought from PostgreSQL. With a look at other systems like Presto/Hive, they don't support those ambiguous fields too. e.g. Hive 2.2.x also take it from PostgreSQL but without introducing those ambiguous fields https://issues.apache.org/jira/secure/attachment/12828349/HIVE-14579 e.g. presto ```sql presto> select extract(quater from interval '10-0' year to month); Query 20200417_094723_00020_m8xq4 failed: line 1:8: Invalid EXTRACT field: quater select extract(quater from interval '10-0' year to month) presto> select extract(decade from interval '10-0' year to month); Query 20200417_094737_00021_m8xq4 failed: line 1:8: Invalid EXTRACT field: decade select extract(decade from interval '10-0' year to month) ``` ### Does this PR introduce any user-facing change? Yes, as we already have previews versions, this PR will remove support for extracting `millennium/century/decade/quarter/week/millisecond/microseconds` and `epoch` from intervals with `date_part` function ### How was this patch tested? rm some used tests Closes #28242 from yaooqinn/SPARK-31469. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 697083c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Currently, we can extract
millennium/century/decade/year/quarter/month/week/day/hour/minute/second(with fractions)//millisecond/microseconds
andepoch
from interval valuesWhile getting the
millennium/century/decade/year
, it means how many the intervalmonths
part can be converted to that unit-value. The content ofmillennium/century/decade
will overlapyear
and each other.While getting
month/day
and so on, it means the integral remainder of the previous unit. Here all the units includingyear
are individual.So while extracting
year
,month
,day
,hour
,minute
,second
, which are ANSI primary datetime units, the semantic isextracting
, but others might refer totransforming
.While getting epoch we have treat month as 30 days which varies the natural Calendar rules we use.
To avoid ambiguity, I suggest we should only support those extract field defined ANSI with their abbreviations.
Why are the changes needed?
Extracting
millennium
,century
etc does not obey the meaning of extracting, and they are not so useful and worth maintaining.The
extract
is ANSI standard expression anddate_part
is its pg-specific alias function. The current support extract-fields are fully bought from PostgreSQL.With a look at other systems like Presto/Hive, they don't support those ambiguous fields too.
e.g. Hive 2.2.x also take it from PostgreSQL but without introducing those ambiguous fields https://issues.apache.org/jira/secure/attachment/12828349/HIVE-14579
e.g. presto
Does this PR introduce any user-facing change?
Yes, as we already have previews versions, this PR will remove support for extracting
millennium/century/decade/quarter/week/millisecond/microseconds
andepoch
from intervals withdate_part
functionHow was this patch tested?
rm some used tests