Skip to content

[SPARK-31469][SQL] Make extract interval field ANSI compliance #28242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

[SPARK-31469][SQL] Make extract interval field ANSI compliance #28242

wants to merge 2 commits into from

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Apr 17, 2020

What changes were proposed in this pull request?

Currently, we can extract millennium/century/decade/year/quarter/month/week/day/hour/minute/second(with fractions)//millisecond/microseconds and epoch from interval values

While getting the millennium/century/decade/year, it means how many the interval months part can be converted to that unit-value. The content of millennium/century/decade will overlap year and each other.

While getting month/day and so on, it means the integral remainder of the previous unit. Here all the units including year are individual.

So while extracting year, month, day, hour, minute, second, which are ANSI primary datetime units, the semantic is extracting, but others might refer to transforming.

While getting epoch we have treat month as 30 days which varies the natural Calendar rules we use.

To avoid ambiguity, I suggest we should only support those extract field defined ANSI with their abbreviations.

Why are the changes needed?

Extracting millennium, century etc does not obey the meaning of extracting, and they are not so useful and worth maintaining.

The extract is ANSI standard expression and date_part is its pg-specific alias function. The current support extract-fields are fully bought from PostgreSQL.

With a look at other systems like Presto/Hive, they don't support those ambiguous fields too.

e.g. Hive 2.2.x also take it from PostgreSQL but without introducing those ambiguous fields https://issues.apache.org/jira/secure/attachment/12828349/HIVE-14579

e.g. presto

presto> select extract(quater from interval '10-0' year to month);
Query 20200417_094723_00020_m8xq4 failed: line 1:8: Invalid EXTRACT field: quater
select extract(quater from interval '10-0' year to month)

presto> select extract(decade from interval '10-0' year to month);
Query 20200417_094737_00021_m8xq4 failed: line 1:8: Invalid EXTRACT field: decade
select extract(decade from interval '10-0' year to month)

Does this PR introduce any user-facing change?

Yes, as we already have previews versions, this PR will remove support for extracting millennium/century/decade/quarter/week/millisecond/microseconds and epoch from intervals with date_part function

How was this patch tested?

rm some used tests

@yaooqinn
Copy link
Member Author

@cloud-fan
Copy link
Contributor

We should not assume that 1 month = 30 days, so agree to not support epoch.

For millennium/century/decade, looks not useful and they were added for pgsql compatibility. We can add them later if there are user requests.

LGTM

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the standard says what's second is? Is it just seconds or seconds + its fraction?

@yaooqinn
Copy link
Member Author

According to the standard, for SECOND, the declared type of the result is an implementation-defined exact numeric type with scale not less than the specified or implied interval fractional seconds precision

@SparkQA
Copy link

SparkQA commented Apr 17, 2020

Test build #121407 has finished for PR 28242 at commit 8af2def.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

@SparkQA
Copy link

SparkQA commented Apr 17, 2020

Test build #121406 has finished for PR 28242 at commit a88dedf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan closed this in 697083c Apr 17, 2020
cloud-fan pushed a commit that referenced this pull request Apr 17, 2020
### What changes were proposed in this pull request?

Currently, we can extract `millennium/century/decade/year/quarter/month/week/day/hour/minute/second(with fractions)//millisecond/microseconds` and `epoch` from interval values

While getting the `millennium/century/decade/year`, it means how many the interval `months` part can be converted to that unit-value. The content of `millennium/century/decade` will overlap `year` and each other.

While getting `month/day` and so on, it means the integral remainder of the previous unit. Here all the units including `year` are individual.

So while extracting `year`, `month`, `day`, `hour`, `minute`, `second`, which are ANSI primary datetime units, the semantic is `extracting`, but others might refer to `transforming`.

While getting epoch we have treat month as 30 days which varies the natural Calendar rules we use.

To avoid ambiguity, I suggest we should only support those extract field defined ANSI with their abbreviations.

### Why are the changes needed?

Extracting `millennium`, `century` etc does not obey the meaning of extracting, and they are not so useful and worth maintaining.

The `extract` is ANSI standard expression and `date_part` is its pg-specific alias function. The current support extract-fields are fully bought from PostgreSQL.

With a look at other systems like Presto/Hive, they don't support those ambiguous fields too.

e.g. Hive 2.2.x also take it from PostgreSQL but without introducing those ambiguous fields https://issues.apache.org/jira/secure/attachment/12828349/HIVE-14579

e.g. presto

```sql
presto> select extract(quater from interval '10-0' year to month);
Query 20200417_094723_00020_m8xq4 failed: line 1:8: Invalid EXTRACT field: quater
select extract(quater from interval '10-0' year to month)

presto> select extract(decade from interval '10-0' year to month);
Query 20200417_094737_00021_m8xq4 failed: line 1:8: Invalid EXTRACT field: decade
select extract(decade from interval '10-0' year to month)

```

### Does this PR introduce any user-facing change?

Yes, as we already have previews versions, this PR will remove support for extracting `millennium/century/decade/quarter/week/millisecond/microseconds` and `epoch` from intervals with `date_part` function

### How was this patch tested?

rm some used tests

Closes #28242 from yaooqinn/SPARK-31469.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 697083c)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants