Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ARROW] Arrow serialization should not introduce extra shuffle for outermost limit #4662

Closed
wants to merge 28 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
a584943
arrow take
cfmcgrady Mar 23, 2023
8593d85
driver slice last batch
cfmcgrady Mar 24, 2023
0088671
refine
cfmcgrady Mar 29, 2023
ed8c692
refactor
cfmcgrady Apr 3, 2023
4212a89
refactor and add ut
cfmcgrady Apr 4, 2023
6c5b1eb
add ut
cfmcgrady Apr 4, 2023
ee5a756
revert unnecessarily changes
cfmcgrady Apr 4, 2023
4e7ca54
unnecessarily changes
cfmcgrady Apr 4, 2023
885cf2c
infer row size by schema.defaultSize
cfmcgrady Apr 4, 2023
25e4f05
add docs
cfmcgrady Apr 4, 2023
03d0747
address comment
cfmcgrady Apr 6, 2023
2286afc
reflective calla AdaptiveSparkPlanExec.finalPhysicalPlan
cfmcgrady Apr 6, 2023
81886f0
address comment
cfmcgrady Apr 6, 2023
e3bf84c
refactor
cfmcgrady Apr 6, 2023
d70aee3
SparkPlan.session -> SparkSession.active to adapt Spark-3.1.x
cfmcgrady Apr 6, 2023
4cef204
SparkArrowbasedOperationSuite adapt Spark-3.1.x
cfmcgrady Apr 6, 2023
573a262
fix
cfmcgrady Apr 6, 2023
c83cf3f
SparkArrowbasedOperationSuite adapt Spark-3.1.x
cfmcgrady Apr 6, 2023
9ffb44f
make toBatchIterator private
cfmcgrady Apr 6, 2023
b72bc6f
add offset support to adapt Spark-3.4.x
cfmcgrady Apr 6, 2023
22cc70f
add ut
cfmcgrady Apr 6, 2023
8280783
add `isStaticConfigKey` to adapt Spark-3.1.x
cfmcgrady Apr 7, 2023
6d596fc
address comment
cfmcgrady Apr 7, 2023
6064ab9
limit = 0 test case
cfmcgrady Apr 7, 2023
3700839
SparkArrowbasedOperationSuite adapt Spark-3.1.x
cfmcgrady Apr 7, 2023
facc13f
exclude rule OptimizeLimitZero
cfmcgrady Apr 7, 2023
130bcb1
finally close
cfmcgrady Apr 7, 2023
82c912e
close vector
cfmcgrady Apr 8, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add offset support to adapt Spark-3.4.x
  • Loading branch information
cfmcgrady committed Apr 6, 2023
commit b72bc6fb2da63bd91b12ab7f237a848b37b5b1ce
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,8 @@ object SparkDatasetHelper {
def executeArrowBatchCollect: SparkPlan => Array[Array[Byte]] = {
case adaptiveSparkPlan: AdaptiveSparkPlanExec =>
executeArrowBatchCollect(finalPhysicalPlan(adaptiveSparkPlan))
case collectLimit: CollectLimitExec =>
// TODO: avoid extra shuffle if `offset` > 0
case collectLimit: CollectLimitExec if offset(collectLimit) <= 0 =>
doCollectLimit(collectLimit)
case plan: SparkPlan =>
toArrowBatchRdd(plan).collect()
Expand Down Expand Up @@ -193,11 +194,24 @@ object SparkDatasetHelper {
val result = fun(plan)
val finalPlanUpdate = DynMethods.builder("finalPlanUpdate")
.hiddenImpl(adaptiveSparkPlanExec.getClass)
.build(adaptiveSparkPlanExec)
.build()
finalPlanUpdate.invoke[Unit](adaptiveSparkPlanExec)
result
}

/**
* offset support was add since Spark-3.4(set SPARK-28330), to ensure backward compatibility with
* earlier versions of Spark, this function uses reflective calls to the "offset".
*/
private def offset(collectLimitExec: CollectLimitExec): Int = {
val offset = DynMethods.builder("offset")
.impl(collectLimitExec.getClass)
.orNoop()
.build()
Option(offset.invoke[Int](collectLimitExec))
.getOrElse(0)
cfmcgrady marked this conversation as resolved.
Show resolved Hide resolved
}

/**
* refer to org.apache.spark.sql.Dataset#withAction(), assign a new execution id for arrow-based
* operation, so that we can track the arrow-based queries on the UI tab.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,26 @@ class SparkArrowbasedOperationSuite extends WithSparkSQLEngine with SparkDataTyp
}
}

test("result offset support") {
assume(SPARK_ENGINE_RUNTIME_VERSION > "3.3")
var numStages = 0
val listener = new SparkListener {
override def onJobStart(jobStart: SparkListenerJobStart): Unit = {
numStages = jobStart.stageInfos.length
}
}
withJdbcStatement() { statement =>
withSparkListener(listener) {
withPartitionedTable("t_3") {
statement.executeQuery("select * from t_3 limit 10 offset 10")
}
KyuubiSparkContextHelper.waitListenerBus(spark)
}
}
// the extra shuffle be introduced if the `offset` > 0
assert(numStages == 2)
}

test("arrow serialization should not introduce extra shuffle for outermost limit") {
var numStages = 0
val listener = new SparkListener {
Expand Down