[SPARK-45071][SQL] Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data #42804

zzzzming95 · 2023-09-04T13:30:52Z

What changes were proposed in this pull request?

Since BinaryArithmetic#dataType will recursively process the datatype of each node, the driver will be very slow when multiple columns are processed.

For example, the following code:

import spark.implicits._
import scala.util.Random
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.sum
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}

val N = 30
val M = 100

val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString)
val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5))

val schema = StructType(columns.map(StructField(_, IntegerType)))
val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_)))
val df = spark.createDataFrame(rdd, schema)
val colExprs = columns.map(sum(_))

// gen a new column , and add the other 30 column
df.withColumn("new_col_sum", expr(columns.mkString(" + ")))

This code will take a few minutes for the driver to execute in the spark3.4 version, but only takes a few seconds to execute in the spark3.2 version. Related issue: SPARK-39316

Why are the changes needed?

Optimize the processing speed of BinaryArithmetic#dataType when processing multi-column data

Does this PR introduce any user-facing change?

No

How was this patch tested?

manual testing

Was this patch authored or co-authored using generative AI tooling?

no

zzzzming95 · 2023-09-05T00:50:26Z

@ulysses-you cc

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala

wangyum · 2023-09-05T07:19:06Z

cc @cloud-fan

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala

zzzzming95 · 2023-09-05T13:12:02Z

@cloud-fan @wangyum

Please merge it to master , thanks

…#dataType` when processing multi-column data ### What changes were proposed in this pull request? Since `BinaryArithmetic#dataType` will recursively process the datatype of each node, the driver will be very slow when multiple columns are processed. For example, the following code: ```scala import spark.implicits._ import scala.util.Random import org.apache.spark.sql.Row import org.apache.spark.sql.functions.sum import org.apache.spark.sql.types.{StructType, StructField, IntegerType} val N = 30 val M = 100 val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString) val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5)) val schema = StructType(columns.map(StructField(_, IntegerType))) val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_))) val df = spark.createDataFrame(rdd, schema) val colExprs = columns.map(sum(_)) // gen a new column , and add the other 30 column df.withColumn("new_col_sum", expr(columns.mkString(" + "))) ``` This code will take a few minutes for the driver to execute in the spark3.4 version, but only takes a few seconds to execute in the spark3.2 version. Related issue: [SPARK-39316](#36698) ### Why are the changes needed? Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? manual testing ### Was this patch authored or co-authored using generative AI tooling? no Closes #42804 from zzzzming95/SPARK-45071. Authored-by: zzzzming95 <505306252@qq.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 16e813c) Signed-off-by: Yuming Wang <yumwang@ebay.com>

wangyum · 2023-09-06T03:39:54Z

Merged to master, branch-3.5 and branch-3.4.

…#dataType` when processing multi-column data ### What changes were proposed in this pull request? Since `BinaryArithmetic#dataType` will recursively process the datatype of each node, the driver will be very slow when multiple columns are processed. For example, the following code: ```scala import spark.implicits._ import scala.util.Random import org.apache.spark.sql.Row import org.apache.spark.sql.functions.sum import org.apache.spark.sql.types.{StructType, StructField, IntegerType} val N = 30 val M = 100 val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString) val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5)) val schema = StructType(columns.map(StructField(_, IntegerType))) val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_))) val df = spark.createDataFrame(rdd, schema) val colExprs = columns.map(sum(_)) // gen a new column , and add the other 30 column df.withColumn("new_col_sum", expr(columns.mkString(" + "))) ``` This code will take a few minutes for the driver to execute in the spark3.4 version, but only takes a few seconds to execute in the spark3.2 version. Related issue: [SPARK-39316](apache#36698) ### Why are the changes needed? Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? manual testing ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#42804 from zzzzming95/SPARK-45071. Authored-by: zzzzming95 <505306252@qq.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 16e813c) Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit a96804b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…#dataType` when processing multi-column data ### What changes were proposed in this pull request? Since `BinaryArithmetic#dataType` will recursively process the datatype of each node, the driver will be very slow when multiple columns are processed. For example, the following code: ```scala import spark.implicits._ import scala.util.Random import org.apache.spark.sql.Row import org.apache.spark.sql.functions.sum import org.apache.spark.sql.types.{StructType, StructField, IntegerType} val N = 30 val M = 100 val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString) val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5)) val schema = StructType(columns.map(StructField(_, IntegerType))) val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_))) val df = spark.createDataFrame(rdd, schema) val colExprs = columns.map(sum(_)) // gen a new column , and add the other 30 column df.withColumn("new_col_sum", expr(columns.mkString(" + "))) ``` This code will take a few minutes for the driver to execute in the spark3.4 version, but only takes a few seconds to execute in the spark3.2 version. Related issue: [SPARK-39316](apache#36698) ### Why are the changes needed? Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? manual testing ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#42804 from zzzzming95/SPARK-45071. Authored-by: zzzzming95 <505306252@qq.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>

github-actions bot added the SQL label Sep 4, 2023

update arithmetic.scala

b250ca4

zzzzming95 force-pushed the SPARK-45071 branch from 8d357f7 to b250ca4 Compare September 4, 2023 14:03

wangyum reviewed Sep 5, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala Outdated Show resolved Hide resolved

update arithmetic.scala

d42f162

wangyum approved these changes Sep 5, 2023

View reviewed changes

cloud-fan reviewed Sep 5, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala Show resolved Hide resolved

cloud-fan approved these changes Sep 5, 2023

View reviewed changes

wangyum closed this in 16e813c Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-45071][SQL] Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data #42804

[SPARK-45071][SQL] Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data #42804

Uh oh!

zzzzming95 commented Sep 4, 2023 •

edited by wangyum

Loading

Uh oh!

zzzzming95 commented Sep 5, 2023

Uh oh!

Uh oh!

wangyum commented Sep 5, 2023

Uh oh!

Uh oh!

zzzzming95 commented Sep 5, 2023

Uh oh!

wangyum commented Sep 6, 2023

Uh oh!

Uh oh!

[SPARK-45071][SQL] Optimize the processing speed of BinaryArithmetic#dataType when processing multi-column data #42804

[SPARK-45071][SQL] Optimize the processing speed of BinaryArithmetic#dataType when processing multi-column data #42804

Uh oh!

Conversation

zzzzming95 commented Sep 4, 2023 • edited by wangyum Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zzzzming95 commented Sep 5, 2023

Uh oh!

Uh oh!

wangyum commented Sep 5, 2023

Uh oh!

Uh oh!

zzzzming95 commented Sep 5, 2023

Uh oh!

wangyum commented Sep 6, 2023

Uh oh!

Uh oh!

[SPARK-45071][SQL] Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data #42804

[SPARK-45071][SQL] Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data #42804

zzzzming95 commented Sep 4, 2023 •

edited by wangyum

Loading