Skip to content

[SPARK-28395][SQL] Division operator support integral division #25158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

[SPARK-28395][SQL] Division operator support integral division #25158

wants to merge 1 commit into from

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented Jul 15, 2019

What changes were proposed in this pull request?

PostgreSQL, Teradata, SQL Server, DB2 and Presto perform integral division with the / operator.
But Oracle, Vertica, Hive, MySQL and MariaDB perform fractional division with the / operator.

This pr add a flag(spark.sql.function.preferIntegralDivision) to control whether to use integral division with the / operator.

Examples:

PostgreSQL:

postgres=# select substr(version(), 0, 16), cast(10 as int) / cast(3 as int), cast(10.1 as float8) / cast(3 as int), cast(10 as int) / cast(3.1 as float8), cast(10.1 as float8)/cast(3.1 as float8);
     substr      | ?column? |     ?column?     |    ?column?     |     ?column?
-----------------+----------+------------------+-----------------+------------------
 PostgreSQL 11.3 |        3 | 3.36666666666667 | 3.2258064516129 | 3.25806451612903
(1 row)

SQL Server:

1> select cast(10 as int) / cast(3 as int), cast(10.1 as float) / cast(3 as int), cast(10 as int) / cast(3.1 as float), cast(10.1 as float)/cast(3.1 as float);
2> go

----------- ------------------------ ------------------------ ------------------------
          3       3.3666666666666667        3.225806451612903        3.258064516129032

(1 rows affected)

DB2:

[db2inst1@2f3c821d36b7 ~]$ db2 "select cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double) from table (sysproc.env_get_inst_info())"

1           2                        3                        4
----------- ------------------------ ------------------------ ------------------------
          3   +3.36666666666667E+000   +3.22580645161290E+000   +3.25806451612903E+000

  1 record(s) selected.

Presto:

presto> select cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double);
 _col0 |       _col1        |       _col2       |       _col3
-------+--------------------+-------------------+-------------------
     3 | 3.3666666666666667 | 3.225806451612903 | 3.258064516129032
(1 row)

Teradata:
image

Oracle:

SQL> select 10 / 3 from dual;

      10/3
----------
3.33333333

Vertica

dbadmin=> select version(), cast(10 as int) / cast(3 as int), cast(10.1 as float8) / cast(3 as int), cast(10 as int) / cast(3.1 as float8), cast(10.1 as float8)/cast(3.1 as float8);
              version               |       ?column?       |     ?column?     |    ?column?     |     ?column?
------------------------------------+----------------------+------------------+-----------------+------------------
 Vertica Analytic Database v9.1.1-0 | 3.333333333333333333 | 3.36666666666667 | 3.2258064516129 | 3.25806451612903
(1 row)

Hive:

hive> select cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double);
OK
3.3333333333333335	3.3666666666666667	3.225806451612903	3.258064516129032
Time taken: 0.143 seconds, Fetched: 1 row(s)

MariaDB:

MariaDB [(none)]> select version(), cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double);
+--------------------------------------+----------------------------------+---------------------------------------+---------------------------------------+------------------------------------------+
| version()                            | cast(10 as int) / cast(3 as int) | cast(10.1 as double) / cast(3 as int) | cast(10 as int) / cast(3.1 as double) | cast(10.1 as double)/cast(3.1 as double) |
+--------------------------------------+----------------------------------+---------------------------------------+---------------------------------------+------------------------------------------+
| 10.4.6-MariaDB-1:10.4.6+maria~bionic |                           3.3333 |                    3.3666666666666667 |                     3.225806451612903 |                        3.258064516129032 |
+--------------------------------------+----------------------------------+---------------------------------------+---------------------------------------+------------------------------------------+
1 row in set (0.000 sec)

MySQL:

mysql>  select version(), 10 / 3, 10 / 3.1, 10.1 / 3, 10.1 / 3.1;
+-----------+--------+----------+----------+------------+
| version() | 10 / 3 | 10 / 3.1 | 10.1 / 3 | 10.1 / 3.1 |
+-----------+--------+----------+----------+------------+
| 8.0.16    | 3.3333 |   3.2258 |  3.36667 |    3.25806 |
+-----------+--------+----------+----------+------------+
1 row in set (0.00 sec)

How was this patch tested?

unit tests

@SparkQA
Copy link

SparkQA commented Jul 15, 2019

Test build #107673 has finished for PR 25158 at commit c1990b5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Division(conf: SQLConf) extends TypeCoercionRule

@wangyum
Copy link
Member Author

wangyum commented Jul 15, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Jul 15, 2019

Test build #107677 has finished for PR 25158 at commit c1990b5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Division(conf: SQLConf) extends TypeCoercionRule

@wangyum
Copy link
Member Author

wangyum commented Jul 16, 2019

Benchmark and benchmark result:

cat <<EOF > sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/SPARK_28395_Benchmark.scala
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.spark.sql.execution.benchmark

import org.apache.spark.benchmark.Benchmark
import org.apache.spark.sql.internal.SQLConf

/**
 * To run this benchmark:
 *   build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.SPARK_28395_Benchmark"
 */
object SPARK_28395_Benchmark extends SqlBasedBenchmark {

  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {

    val title = "Benchmark SPARK-28395"
    runBenchmark(title) {
      withTempPath { dir =>
        val N = 6000000
        val df = spark.range(N)

        df.selectExpr("id as id1", "cast(id % 999999 as bigint) as id2")
          .write.mode("overwrite").parquet(dir.getCanonicalPath)

        val benchmark = new Benchmark(title, N, minNumIters = 5, output = output)
        Seq(false, true).foreach { integralDivision =>
          val name = if (integralDivision) "Integral division" else "Fractional division"
          benchmark.addCase(name) { _ =>
            withSQLConf(SQLConf.PREFER_INTEGRAL_DIVISION.key -> integralDivision.toString) {
              spark.read.parquet(dir.getCanonicalPath).selectExpr("id1 / id2").collect()
            }
          }
        }
        benchmark.run()
      }
    }
  }
}

EOF
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_211-b12 on Linux 3.10.0-957.1.3.el7.x86_64
[info] Intel Core Processor (Broadwell)
[info] Benchmark SPARK-28395:                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] Fractional division                                3103           5975         877          1.9         517.2       1.0X
[info] Integral division                                  2084           2293         195          2.9         347.3       1.5X

@gatorsmile
Copy link
Member

cc @cloud-fan @gengliangwang

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 6926849 Jul 16, 2019
@wangyum wangyum deleted the SPARK-28395 branch July 16, 2019 07:56
vinodkc pushed a commit to vinodkc/spark that referenced this pull request Jul 18, 2019
## What changes were proposed in this pull request?

PostgreSQL, Teradata, SQL Server, DB2 and Presto perform integral division with the `/` operator.
But Oracle, Vertica, Hive, MySQL and MariaDB perform fractional division with the `/` operator.

This pr add a flag(`spark.sql.function.preferIntegralDivision`) to control whether to use integral division with the `/` operator.

Examples:

**PostgreSQL**:
```sql
postgres=# select substr(version(), 0, 16), cast(10 as int) / cast(3 as int), cast(10.1 as float8) / cast(3 as int), cast(10 as int) / cast(3.1 as float8), cast(10.1 as float8)/cast(3.1 as float8);
     substr      | ?column? |     ?column?     |    ?column?     |     ?column?
-----------------+----------+------------------+-----------------+------------------
 PostgreSQL 11.3 |        3 | 3.36666666666667 | 3.2258064516129 | 3.25806451612903
(1 row)
```
**SQL Server**:
```sql
1> select cast(10 as int) / cast(3 as int), cast(10.1 as float) / cast(3 as int), cast(10 as int) / cast(3.1 as float), cast(10.1 as float)/cast(3.1 as float);
2> go

----------- ------------------------ ------------------------ ------------------------
          3       3.3666666666666667        3.225806451612903        3.258064516129032

(1 rows affected)
```
**DB2**:
```sql
[db2inst12f3c821d36b7 ~]$ db2 "select cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double) from table (sysproc.env_get_inst_info())"

1           2                        3                        4
----------- ------------------------ ------------------------ ------------------------
          3   +3.36666666666667E+000   +3.22580645161290E+000   +3.25806451612903E+000

  1 record(s) selected.
```
**Presto**:
```sql
presto> select cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double);
 _col0 |       _col1        |       _col2       |       _col3
-------+--------------------+-------------------+-------------------
     3 | 3.3666666666666667 | 3.225806451612903 | 3.258064516129032
(1 row)
```
**Teradata**:
![image](https://user-images.githubusercontent.com/5399861/61200701-e97d5380-a714-11e9-9a1d-57fd99d38c8d.png)

**Oracle**:
```sql
SQL> select 10 / 3 from dual;

      10/3
----------
3.33333333
```
**Vertica**
```sql
dbadmin=> select version(), cast(10 as int) / cast(3 as int), cast(10.1 as float8) / cast(3 as int), cast(10 as int) / cast(3.1 as float8), cast(10.1 as float8)/cast(3.1 as float8);
              version               |       ?column?       |     ?column?     |    ?column?     |     ?column?
------------------------------------+----------------------+------------------+-----------------+------------------
 Vertica Analytic Database v9.1.1-0 | 3.333333333333333333 | 3.36666666666667 | 3.2258064516129 | 3.25806451612903
(1 row)
```
**Hive**:
```sql
hive> select cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double);
OK
3.3333333333333335	3.3666666666666667	3.225806451612903	3.258064516129032
Time taken: 0.143 seconds, Fetched: 1 row(s)
```
**MariaDB**:
```sql
MariaDB [(none)]> select version(), cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double);
+--------------------------------------+----------------------------------+---------------------------------------+---------------------------------------+------------------------------------------+
| version()                            | cast(10 as int) / cast(3 as int) | cast(10.1 as double) / cast(3 as int) | cast(10 as int) / cast(3.1 as double) | cast(10.1 as double)/cast(3.1 as double) |
+--------------------------------------+----------------------------------+---------------------------------------+---------------------------------------+------------------------------------------+
| 10.4.6-MariaDB-1:10.4.6+maria~bionic |                           3.3333 |                    3.3666666666666667 |                     3.225806451612903 |                        3.258064516129032 |
+--------------------------------------+----------------------------------+---------------------------------------+---------------------------------------+------------------------------------------+
1 row in set (0.000 sec)
```
**MySQL**:
```sql
mysql>  select version(), 10 / 3, 10 / 3.1, 10.1 / 3, 10.1 / 3.1;
+-----------+--------+----------+----------+------------+
| version() | 10 / 3 | 10 / 3.1 | 10.1 / 3 | 10.1 / 3.1 |
+-----------+--------+----------+----------+------------+
| 8.0.16    | 3.3333 |   3.2258 |  3.36667 |    3.25806 |
+-----------+--------+----------+----------+------------+
1 row in set (0.00 sec)
```
## How was this patch tested?

unit tests

Closes apache#25158 from wangyum/SPARK-28395.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@@ -1524,6 +1524,12 @@ object SQLConf {
.booleanConf
.createWithDefault(false)

val PREFER_INTEGRAL_DIVISION = buildConf("spark.sql.function.preferIntegralDivision")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a second thought, I think we should not add a new behavior that is not SQL standard. If we only need it in tests, shall we make it clear in the config name? and make it an internal config.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that SQL standard does not explain this case(integral / integral ), different databases have different implementations.
image

Copy link
Contributor

@cloud-fan cloud-fan Aug 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so I'd like to treat it as an internal config that is only used in the ported pgsql test cases. @wangyum can you send a follow-up PR? thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

zzcclp pushed a commit to zzcclp/spark that referenced this pull request Aug 8, 2019
…ivision internal

## What changes were proposed in this pull request?

This PR makes `spark.sql.function.preferIntegralDivision` to internal configuration because it is only used for PostgreSQL test cases.

More details:
apache#25158 (comment)

## How was this patch tested?

N/A

Closes apache#25376 from wangyum/SPARK-28395-2.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
cloud-fan pushed a commit that referenced this pull request Sep 26, 2019
### What changes were proposed in this pull request?

After #25158 and #25458, SQL features of PostgreSQL are introduced into Spark. AFAIK, both features are implementation-defined behaviors, which are not specified in ANSI SQL.
In such a case, this proposal is to add a configuration `spark.sql.dialect` for choosing a database dialect.
After this PR, Spark supports two database dialects, `Spark` and `PostgreSQL`. With `PostgreSQL` dialect, Spark will:
1. perform integral division with the / operator if both sides are integral types;
2. accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type.

### Why are the changes needed?

Unify the external database dialect with one configuration, instead of small flags.

### Does this PR introduce any user-facing change?

A new configuration `spark.sql.dialect` for choosing a database dialect.

### How was this patch tested?

Existing tests.

Closes #25697 from gengliangwang/dialect.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Dec 10, 2019
### What changes were proposed in this pull request?
Reprocess all PostgreSQL dialect related PRs, listing in order:

- #25158: PostgreSQL integral division support [revert]
- #25170: UT changes for the integral division support [revert]
- #25458: Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. [revert]
- #25697: Combine below 2 feature tags into "spark.sql.dialect" [revert]
- #26112: Date substraction support [keep the ANSI-compliant part]
- #26444: Rename config "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled" [revert]
- #26463: Cast to boolean support for PostgreSQL dialect [revert]
- #26584: Make the behavior of Postgre dialect independent of ansi mode config [keep the ANSI-compliant part]

### Why are the changes needed?
As the discussion in http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-PostgreSQL-dialect-td28417.html, we need to remove PostgreSQL dialect form code base for several reasons:
1. The current approach makes the codebase complicated and hard to maintain.
2. Fully migrating PostgreSQL workloads to Spark SQL is not our focus for now.

### Does this PR introduce any user-facing change?
Yes, the config `spark.sql.dialect` will be removed.

### How was this patch tested?
Existing UT.

Closes #26763 from xuanyuanking/SPARK-30125.

Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants