[SPARK-18025] Use commit protocol API in structured streaming #15710

rxin · 2016-11-01T05:27:41Z

What changes were proposed in this pull request?

This patch adds a new commit protocol implementation ManifestFileCommitProtocol that follows the existing streaming flow, and uses it in FileStreamSink to consolidate the write path in structured streaming with the batch mode write path.

This deletes a lot of code, and would make it trivial to support other functionalities that are currently available in batch but not in streaming, including all file formats and bucketing.

How was this patch tested?

Should be covered by existing tests.

rxin · 2016-11-01T05:44:15Z

cc @ericl, @marmbrus, @zsxwing and @lw-lin (I guess this would supersede your old PR).

rxin · 2016-11-01T05:45:42Z

...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala

        isAppend)
+
+      WriteOutput.write(


I'm thinking I should just rename WriteOutput to FileFormatOutput

SparkQA · 2016-11-01T06:23:42Z

Test build #3387 has finished for PR 15710 at commit e9823e7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-01T06:25:05Z

Test build #67873 has finished for PR 15710 at commit e9823e7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-01T06:34:10Z

Test build #67874 has finished for PR 15710 at commit 1c906c9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-01T08:23:41Z

Test build #67877 has finished for PR 15710 at commit 1c3a645.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-01T21:13:33Z

Test build #67913 has finished for PR 15710 at commit a2ea180.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-01T21:13:40Z

Test build #67912 has finished for PR 15710 at commit 0742318.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-11-02T00:54:46Z

...ore/src/main/scala/org/apache/spark/sql/execution/streaming/ManifestFileCommitProtocol.scala

+  }
+
+  override def commitTask(taskContext: TaskAttemptContext): TaskCommitMessage = {
+    if (addedFiles.nonEmpty) {


Is this just an optimization to avoid instantiating the fs for empty writes?

I was copying the same logic from before -- but i think so...

actually the other thing is that we are using the head. Technically we can use headOption and than map over it but it will be pretty weird ..

marmbrus · 2016-11-02T00:57:57Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOutputWriter.scala

-  }
-}
-
+import org.apache.spark.sql.execution.datasources.OutputWriter


Why is this down here?

It's not. This is the top.

oh... i see

marmbrus · 2016-11-02T01:01:42Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSinkSuite.scala

 import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
 import org.apache.spark.util.Utils

 class FileStreamSinkSuite extends StreamTest {
  import testImplicits._

-
-  test("FileStreamSinkWriter - unpartitioned data") {


What about these tests?

They were testing code that's been deleted completely and is now purely redundant with all the tests we have for the batch write path.

marmbrus · 2016-11-02T01:05:37Z

LGTM

marmbrus · 2016-11-02T01:06:52Z

Thanks, merging to master.

## What changes were proposed in this pull request? This patch adds a new commit protocol implementation ManifestFileCommitProtocol that follows the existing streaming flow, and uses it in FileStreamSink to consolidate the write path in structured streaming with the batch mode write path. This deletes a lot of code, and would make it trivial to support other functionalities that are currently available in batch but not in streaming, including all file formats and bucketing. ## How was this patch tested? Should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes apache#15710 from rxin/SPARK-18025.

rxin added 5 commits October 31, 2016 22:24

[SPARK-18025] Use commit protocol API in structured streaming

416ad5f

Slightly shorter line

ed5e5bc

Delete more code

70b13e0

Updated documentation

e9823e7

Configurable commit protocol

1c906c9

rxin commented Nov 1, 2016

View reviewed changes

Unit is not serializable.

1c3a645

rxin mentioned this pull request Nov 1, 2016

[SPARK-18192] Support all file formats in structured streaming #15711

Closed

rxin added 2 commits November 1, 2016 11:53

Use the proper commit protocol for streaming

0742318

Rename WriteOutput

a2ea180

marmbrus reviewed Nov 2, 2016

View reviewed changes

asfgit closed this in 77a9816 Nov 2, 2016

[SPARK-18025] Use commit protocol API in structured streaming #15710

[SPARK-18025] Use commit protocol API in structured streaming #15710

Uh oh!

Conversation

rxin commented Nov 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

rxin commented Nov 1, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 1, 2016

Uh oh!

SparkQA commented Nov 1, 2016

Uh oh!

SparkQA commented Nov 1, 2016

Uh oh!

SparkQA commented Nov 1, 2016

Uh oh!

SparkQA commented Nov 1, 2016

Uh oh!

SparkQA commented Nov 1, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Nov 2, 2016

Uh oh!

marmbrus commented Nov 2, 2016

Uh oh!

Uh oh!

rxin commented Nov 1, 2016 •

edited

Loading