-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-28395][SQL] Division operator support integral division #25158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #107673 has finished for PR 25158 at commit
|
retest this please |
Test build #107677 has finished for PR 25158 at commit
|
Benchmark and benchmark result: cat <<EOF > sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/SPARK_28395_Benchmark.scala
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.spark.sql.execution.benchmark
import org.apache.spark.benchmark.Benchmark
import org.apache.spark.sql.internal.SQLConf
/**
* To run this benchmark:
* build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.SPARK_28395_Benchmark"
*/
object SPARK_28395_Benchmark extends SqlBasedBenchmark {
override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
val title = "Benchmark SPARK-28395"
runBenchmark(title) {
withTempPath { dir =>
val N = 6000000
val df = spark.range(N)
df.selectExpr("id as id1", "cast(id % 999999 as bigint) as id2")
.write.mode("overwrite").parquet(dir.getCanonicalPath)
val benchmark = new Benchmark(title, N, minNumIters = 5, output = output)
Seq(false, true).foreach { integralDivision =>
val name = if (integralDivision) "Integral division" else "Fractional division"
benchmark.addCase(name) { _ =>
withSQLConf(SQLConf.PREFER_INTEGRAL_DIVISION.key -> integralDivision.toString) {
spark.read.parquet(dir.getCanonicalPath).selectExpr("id1 / id2").collect()
}
}
}
benchmark.run()
}
}
}
}
EOF
|
thanks, merging to master! |
## What changes were proposed in this pull request? PostgreSQL, Teradata, SQL Server, DB2 and Presto perform integral division with the `/` operator. But Oracle, Vertica, Hive, MySQL and MariaDB perform fractional division with the `/` operator. This pr add a flag(`spark.sql.function.preferIntegralDivision`) to control whether to use integral division with the `/` operator. Examples: **PostgreSQL**: ```sql postgres=# select substr(version(), 0, 16), cast(10 as int) / cast(3 as int), cast(10.1 as float8) / cast(3 as int), cast(10 as int) / cast(3.1 as float8), cast(10.1 as float8)/cast(3.1 as float8); substr | ?column? | ?column? | ?column? | ?column? -----------------+----------+------------------+-----------------+------------------ PostgreSQL 11.3 | 3 | 3.36666666666667 | 3.2258064516129 | 3.25806451612903 (1 row) ``` **SQL Server**: ```sql 1> select cast(10 as int) / cast(3 as int), cast(10.1 as float) / cast(3 as int), cast(10 as int) / cast(3.1 as float), cast(10.1 as float)/cast(3.1 as float); 2> go ----------- ------------------------ ------------------------ ------------------------ 3 3.3666666666666667 3.225806451612903 3.258064516129032 (1 rows affected) ``` **DB2**: ```sql [db2inst12f3c821d36b7 ~]$ db2 "select cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double) from table (sysproc.env_get_inst_info())" 1 2 3 4 ----------- ------------------------ ------------------------ ------------------------ 3 +3.36666666666667E+000 +3.22580645161290E+000 +3.25806451612903E+000 1 record(s) selected. ``` **Presto**: ```sql presto> select cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double); _col0 | _col1 | _col2 | _col3 -------+--------------------+-------------------+------------------- 3 | 3.3666666666666667 | 3.225806451612903 | 3.258064516129032 (1 row) ``` **Teradata**:  **Oracle**: ```sql SQL> select 10 / 3 from dual; 10/3 ---------- 3.33333333 ``` **Vertica** ```sql dbadmin=> select version(), cast(10 as int) / cast(3 as int), cast(10.1 as float8) / cast(3 as int), cast(10 as int) / cast(3.1 as float8), cast(10.1 as float8)/cast(3.1 as float8); version | ?column? | ?column? | ?column? | ?column? ------------------------------------+----------------------+------------------+-----------------+------------------ Vertica Analytic Database v9.1.1-0 | 3.333333333333333333 | 3.36666666666667 | 3.2258064516129 | 3.25806451612903 (1 row) ``` **Hive**: ```sql hive> select cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double); OK 3.3333333333333335 3.3666666666666667 3.225806451612903 3.258064516129032 Time taken: 0.143 seconds, Fetched: 1 row(s) ``` **MariaDB**: ```sql MariaDB [(none)]> select version(), cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double); +--------------------------------------+----------------------------------+---------------------------------------+---------------------------------------+------------------------------------------+ | version() | cast(10 as int) / cast(3 as int) | cast(10.1 as double) / cast(3 as int) | cast(10 as int) / cast(3.1 as double) | cast(10.1 as double)/cast(3.1 as double) | +--------------------------------------+----------------------------------+---------------------------------------+---------------------------------------+------------------------------------------+ | 10.4.6-MariaDB-1:10.4.6+maria~bionic | 3.3333 | 3.3666666666666667 | 3.225806451612903 | 3.258064516129032 | +--------------------------------------+----------------------------------+---------------------------------------+---------------------------------------+------------------------------------------+ 1 row in set (0.000 sec) ``` **MySQL**: ```sql mysql> select version(), 10 / 3, 10 / 3.1, 10.1 / 3, 10.1 / 3.1; +-----------+--------+----------+----------+------------+ | version() | 10 / 3 | 10 / 3.1 | 10.1 / 3 | 10.1 / 3.1 | +-----------+--------+----------+----------+------------+ | 8.0.16 | 3.3333 | 3.2258 | 3.36667 | 3.25806 | +-----------+--------+----------+----------+------------+ 1 row in set (0.00 sec) ``` ## How was this patch tested? unit tests Closes apache#25158 from wangyum/SPARK-28395. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@@ -1524,6 +1524,12 @@ object SQLConf { | |||
.booleanConf | |||
.createWithDefault(false) | |||
|
|||
val PREFER_INTEGRAL_DIVISION = buildConf("spark.sql.function.preferIntegralDivision") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a second thought, I think we should not add a new behavior that is not SQL standard. If we only need it in tests, shall we make it clear in the config name? and make it an internal config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so I'd like to treat it as an internal config that is only used in the ported pgsql test cases. @wangyum can you send a follow-up PR? thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
…ivision internal ## What changes were proposed in this pull request? This PR makes `spark.sql.function.preferIntegralDivision` to internal configuration because it is only used for PostgreSQL test cases. More details: apache#25158 (comment) ## How was this patch tested? N/A Closes apache#25376 from wangyum/SPARK-28395-2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request? After #25158 and #25458, SQL features of PostgreSQL are introduced into Spark. AFAIK, both features are implementation-defined behaviors, which are not specified in ANSI SQL. In such a case, this proposal is to add a configuration `spark.sql.dialect` for choosing a database dialect. After this PR, Spark supports two database dialects, `Spark` and `PostgreSQL`. With `PostgreSQL` dialect, Spark will: 1. perform integral division with the / operator if both sides are integral types; 2. accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. ### Why are the changes needed? Unify the external database dialect with one configuration, instead of small flags. ### Does this PR introduce any user-facing change? A new configuration `spark.sql.dialect` for choosing a database dialect. ### How was this patch tested? Existing tests. Closes #25697 from gengliangwang/dialect. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? Reprocess all PostgreSQL dialect related PRs, listing in order: - #25158: PostgreSQL integral division support [revert] - #25170: UT changes for the integral division support [revert] - #25458: Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. [revert] - #25697: Combine below 2 feature tags into "spark.sql.dialect" [revert] - #26112: Date substraction support [keep the ANSI-compliant part] - #26444: Rename config "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled" [revert] - #26463: Cast to boolean support for PostgreSQL dialect [revert] - #26584: Make the behavior of Postgre dialect independent of ansi mode config [keep the ANSI-compliant part] ### Why are the changes needed? As the discussion in http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-PostgreSQL-dialect-td28417.html, we need to remove PostgreSQL dialect form code base for several reasons: 1. The current approach makes the codebase complicated and hard to maintain. 2. Fully migrating PostgreSQL workloads to Spark SQL is not our focus for now. ### Does this PR introduce any user-facing change? Yes, the config `spark.sql.dialect` will be removed. ### How was this patch tested? Existing UT. Closes #26763 from xuanyuanking/SPARK-30125. Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
PostgreSQL, Teradata, SQL Server, DB2 and Presto perform integral division with the
/
operator.But Oracle, Vertica, Hive, MySQL and MariaDB perform fractional division with the
/
operator.This pr add a flag(
spark.sql.function.preferIntegralDivision
) to control whether to use integral division with the/
operator.Examples:
PostgreSQL:
SQL Server:
DB2:
Presto:
Teradata:

Oracle:
Vertica
Hive:
MariaDB:
MySQL:
How was this patch tested?
unit tests