[SPARK-41407][SQL] Pull out v1 write to WriteFiles #38939

ulysses-you · 2022-12-06T12:45:10Z

What changes were proposed in this pull request?

This pr aims to pull out the details of v1 write files to a new operator WriteFiles(logical) WriteFilesExec(physical). Then we can make v1 write files support whole stage codegen in future.

Introduce WriteFilesSpec to hold all v1 write files information:

case class WriteFilesSpec(
    description: WriteJobDescription,
    committer: FileCommitProtocol,
    concurrentOutputWriterSpecFunc: SparkPlan => Option[ConcurrentOutputWriterSpec])
  extends WriteSpec

In order to compatiable with existed code path, this pr adds a new method executeWrite in SparkPlan:

def executeWrite(writeSpec: WriteSpec): RDD[WriterCommitMessage]

Refactor FileFormatWriter to make write files clearly:

execute write using old code path
execute write using SparkPlan.executeWrite
extract writeAndCommit method to work with both two code path

Why are the changes needed?

This is the preparation work before support v1 write whole stage codegen.

Does this PR introduce any user-facing change?

for user, no

for developer, yes:

add a new method executeWrite in SparkPlan
add a new interface WriteSpec

How was this patch tested?

pass CI with spark.sql.optimizer.plannedWrite.enabled on/off

cloud-fan · 2022-12-07T11:41:15Z

sql/core/src/main/java/org/apache/spark/sql/connector/write/WriteSpec.java

+ *
+ * @since 3.4.0
+ */
+public interface WriteSpec extends Serializable {}


does it need to be a public DS v2 API?

it is actually not for v2. The new added method has two interfaces:

`def executeWrite(writeSpec: WriteSpec): RDD[WriterCommitMessage]`

I just make WriteSpec as a java interface to be consistent with WriterCommitMessage.

then this is in the wrong package. it should be put in an internal package

changed to internal package

cloud-fan · 2022-12-07T11:49:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteFiles.scala

+ * [[WriteFiles]] must be the root plan as the child of [[V1WriteCommand]].
+ */
+case class WriteFiles(child: LogicalPlan) extends UnaryNode {
+  override def output: Seq[Attribute] = child.output


the output should match physical plan and be Nil as well.

It should be child.output, otherwise DataWritingCommand can not work. WriteFiles is the child of DataWritingCommand and DataWritingCommand use its child output as the final output at planner phase.

cloud-fan · 2022-12-07T12:08:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala

+   *
+   * Concrete implementations of SparkPlan should override `doExecuteWrite`.
+   */
+  def executeWrite(writeSpec: WriteSpec): RDD[WriterCommitMessage] = executeQuery {


regardless of how hard to implement, ideally which information should be in the WriteFiles operator and which should be passed as parameters?

Let me list the required things of current v1 write files:

WriteJobDescription, includes hadoop job (hadoop conf), fileFormat, outputSpec, partitionColumns, bucketSpec, options, statsTrackers

FileCommitProtocol, includes output path, dynamic partition overwrite flag

ConcurrentOutputWriterSpec, includes requiredOrdering, bucketSpec, physical sortPlan

According to the existed datasource v1 writes command. WriteFiles should hold at least: FileFormat, OutputSpec, partitionColumns, bucketSpec, options, requiredOrdering.

case class InsertIntoHadoopFsRelationCommand( outputPath: Path, staticPartitions: TablePartitionSpec, ifPartitionNotExists: Boolean, partitionColumns: Seq[Attribute], bucketSpec: Option[BucketSpec], fileFormat: FileFormat, options: Map[String, String], query: LogicalPlan, mode: SaveMode, catalogTable: Option[CatalogTable], fileIndex: Option[FileIndex], outputColumnNames: Seq[String])

Due to we can not get physical plan at logical side, and ConcurrentOutputWriterSpec depend on physical. It should be held in WriteFilesSpec.

FileCommitProtocol should be held in WriteFilesSpec, because WriteFiles only do the partial work about task due to the pipeline setup job -> setup task -> commit task -> commit job.
And the same reason for statsTrackers.

According to the usage of hadoop job (FileCommitProtocol.setup(Job)), I tend to make WriteFilesSpec hold hadoop job and hadoop conf.

In sum:

WriteFiles: FileFormat, OutputSpec, partitionColumns, bucketSpec, options and requiredOrdering.
WriteFilesSpec: FileCommitProtocol, statsTrackers, ConcurrentOutputWriterSpec, hadoop job and hadoop conf.

Notes: the aboved does not consider how hard to implement, just based on semantic level.

cloud-fan · 2022-12-07T14:38:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

+      session.sparkContext.runJob(
+        rdd,
+        (context: TaskContext, iter: Iterator[WriterCommitMessage]) => {
+          assert(iter.hasNext)


we should make sure this iterator only have one element

ulysses-you · 2022-12-08T06:47:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala

@@ -785,7 +785,7 @@ private[sql] object QueryExecutionErrors extends QueryErrorsBase {
  def taskFailedWhileWritingRowsError(cause: Throwable): Throwable = {
    new SparkException(
      errorClass = "_LEGACY_ERROR_TEMP_2054",
-      messageParameters = Map.empty,
+      messageParameters = Map("message" -> cause.getMessage),


@cloud-fan after we introduce Writefiles, the error stack for writing changes.

before:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 15.0 failed 1 times, most recent failure: Lost task 0.0 in stage 15.0 (TID 15) (10.221.97.76 executor driver): java.lang.RuntimeException: Exceeds char/varchar type length limitation: 5 at org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils.trimTrailingSpaces(CharVarcharCodegenUtils.java:30) at org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils.charTypeWriteSideCheck(CharVarcharCodegenUtils.java:43) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)

after:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 15.0 failed 1 times, most recent failure: Lost task 0.0 in stage 15.0 (TID 15) (10.221.97.76 executor driver): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:789) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:416) at org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:89) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1502) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.RuntimeException: Exceeds char/varchar type length limitation: 5 at org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils.trimTrailingSpaces(CharVarcharCodegenUtils.java:30) at org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils.charTypeWriteSideCheck(CharVarcharCodegenUtils.java:43) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)

So I added the root cause mesagge into the wrapped SparkException.

ulysses-you · 2022-12-08T06:50:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

@@ -145,6 +145,12 @@ case class CreateDataSourceTableAsSelectCommand(
    outputColumnNames: Seq[String])
  extends V1WriteCommand {

+  override def fileFormatProvider: Boolean = {
+    table.provider.forall { provider =>
+      classOf[FileFormat].isAssignableFrom(DataSource.providingClass(provider, conf))


CreateDataSourceTableAsSelectCommand is not only used to write files, we should only plan v1 writes whose provider is FileFormat

cloud-fan · 2022-12-23T10:14:05Z

thanks, merging to master!

…riter.executeTask ### What changes were proposed in this pull request? This PR is a followup of #38939 that fixes a logical conflict during merging PRs, see #38980 and #38939. ### Why are the changes needed? To recover the broken build. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested: ``` ./build/sbt -Phive clean package ``` Closes #39194 from HyukjinKwon/SPARK-41407. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions bot added the SQL label Dec 6, 2022

ulysses-you force-pushed the v1write-plan branch 3 times, most recently from 1afe010 to 80eb7f4 Compare December 7, 2022 09:22

Pull out v1 write to WriteFiles

c77fe28

ulysses-you force-pushed the v1write-plan branch from 80eb7f4 to c77fe28 Compare December 7, 2022 10:43

ulysses-you changed the title ~~[WIP][SPARK-41407][SQL] Pull out v1 write to WriteFiles~~ [SPARK-41407][SQL] Pull out v1 write to WriteFiles Dec 7, 2022

cloud-fan reviewed Dec 7, 2022

View reviewed changes

ulysses-you added 2 commits December 8, 2022 09:37

address comment

1f345ce

fix ut

2d7693f

github-actions bot added the CORE label Dec 8, 2022

ulysses-you commented Dec 8, 2022

View reviewed changes

reduce diff

6db0173

cloud-fan approved these changes Dec 23, 2022

View reviewed changes

cloud-fan closed this in 2ffa817 Dec 23, 2022

HyukjinKwon mentioned this pull request Dec 23, 2022

[SPARK-41407][SQL][FOLLOW-UP] Use string jobTrackerID for FileFormatWriter.executeTask #39194

Closed

ulysses-you deleted the v1write-plan branch December 26, 2022 05:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-41407][SQL] Pull out v1 write to WriteFiles #38939

[SPARK-41407][SQL] Pull out v1 write to WriteFiles #38939

Uh oh!

ulysses-you commented Dec 6, 2022 •

edited

Loading

Uh oh!

cloud-fan Dec 7, 2022

Uh oh!

ulysses-you Dec 7, 2022

Uh oh!

cloud-fan Dec 7, 2022

Uh oh!

ulysses-you Dec 8, 2022

Uh oh!

cloud-fan Dec 7, 2022

Uh oh!

ulysses-you Dec 7, 2022

Uh oh!

cloud-fan Dec 7, 2022

Uh oh!

ulysses-you Dec 8, 2022 •

edited

Loading

Uh oh!

cloud-fan Dec 7, 2022

Uh oh!

ulysses-you Dec 8, 2022

Uh oh!

ulysses-you Dec 8, 2022

Uh oh!

ulysses-you Dec 8, 2022 •

edited

Loading

Uh oh!

cloud-fan commented Dec 23, 2022

Uh oh!

Uh oh!

[SPARK-41407][SQL] Pull out v1 write to WriteFiles #38939

[SPARK-41407][SQL] Pull out v1 write to WriteFiles #38939

Uh oh!

Conversation

ulysses-you commented Dec 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ulysses-you Dec 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ulysses-you Dec 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 23, 2022

Uh oh!

Uh oh!

ulysses-you commented Dec 6, 2022 •

edited

Loading

ulysses-you Dec 8, 2022 •

edited

Loading

ulysses-you Dec 8, 2022 •

edited

Loading