[SPARK-18042][SQL] OutputWriter should expose file path written #15580

rxin · 2016-10-21T07:33:38Z

What changes were proposed in this pull request?

This patch adds a new "path" method on OutputWriter that returns the path of the file written by the OutputWriter. This is part of the necessary work to consolidate structured streaming and batch write paths.

The batch write path has a nice feature that each data source can define the extension of the files, and allow Spark to specify the staging directory and the prefix for the files. However, in the streaming path we need to collect the list of files written, and there is no interface right now to do that.

How was this patch tested?

N/A - there is no behavior change and this should be covered by existing tests.

rxin · 2016-10-21T07:35:41Z

cc @hvanhovell @cloud-fan

and @ericl

cloud-fan · 2016-10-21T08:46:29Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOutputWriter.scala

    new SerializableConfiguration(conf)
  }

  /**
   * Returns a [[OutputWriter]] that writes data to the give path without using
   * [[OutputCommitter]].
   */
-  override def newWriter(path: String): OutputWriter = new OutputWriter {
+  override def newWriter(path1: String): OutputWriter = new OutputWriter {


how about _path? path1 looks weird...

or maybe we can create a class for this OutputWriter here.

cloud-fan · 2016-10-21T08:53:17Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOutputWriter.scala

-          new Path(stagingDir, fileNamePrefix + extension)
-        }
+    new ParquetOutputFormat[InternalRow]() {
+      override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = {


now we never use the context and extension parameters?

I will add some documentation in my next pull request.

SparkQA · 2016-10-21T09:46:42Z

Test build #67330 has finished for PR 15580 at commit 1942361.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-10-21T19:11:34Z

cc @HyukjinKwon

ericl · 2016-10-21T20:18:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala

+   * The path of the file to be written out. This path should include the staging directory and
+   * the file name prefix passed into the associated createOutputWriter function.
+   */
+  def path: String


fullOutputPath?

SparkQA · 2016-10-21T22:55:51Z

Test build #67344 has finished for PR 15580 at commit d3ddaf7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ericl · 2016-10-22T00:03:13Z

lgtm

rxin · 2016-10-22T00:25:47Z

Thanks - I'm going to merge this. I will address the doc and naming comment in the next pr in this series.

HyukjinKwon · 2016-10-22T13:30:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala

@@ -35,7 +35,7 @@ private[parquet] class ParquetOptions(
   * Compression codec to use. By default use the value specified in SQLConf.
   * Acceptable values are defined in [[shortParquetCompressionCodecNames]].
   */
-  val compressionCodec: String = {
+  val compressionCodecClassName: String = {


@rxin This is a super minor but there are the same options to potentially rename in OrcOptions, JsonOptions, CSVOptions and TextFileFormat - TextFileFormat.scala#L71.

Also, I'd like to note, just in case, that the value in this is actually not the class name (it's something like SNAPPY and LZO) in case of ParquetOptions and OrcOptions whereas text-based ones are actual class names.

HyukjinKwon · 2016-10-22T14:25:43Z

@rxin BTW, could I please ask to include #14529 in your future related PRs (if it looks reasonable)? I will close if you will. (Of course it is also fine to leave it if you are uncertain of the change).

rxin · 2016-10-22T19:01:39Z

@HyukjinKwon sure.

## What changes were proposed in this pull request? This patch adds a new "path" method on OutputWriter that returns the path of the file written by the OutputWriter. This is part of the necessary work to consolidate structured streaming and batch write paths. The batch write path has a nice feature that each data source can define the extension of the files, and allow Spark to specify the staging directory and the prefix for the files. However, in the streaming path we need to collect the list of files written, and there is no interface right now to do that. ## How was this patch tested? N/A - there is no behavior change and this should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes apache#15580 from rxin/SPARK-18042.

rxin added 2 commits October 21, 2016 00:30

[SPARK-18042][SQL] OutputWriter should expose file path written

5f3f728

Reset the writeoutput change

1942361

cloud-fan reviewed Oct 21, 2016

View reviewed changes

Fix bug in CSV

d3ddaf7

ericl reviewed Oct 21, 2016

View reviewed changes

asfgit closed this in 3fbf5a5 Oct 22, 2016

HyukjinKwon reviewed Oct 22, 2016

View reviewed changes

HyukjinKwon mentioned this pull request Aug 28, 2017

[SPARK-21839][SQL] Support SQL config for ORC compression #19055

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-18042][SQL] OutputWriter should expose file path written #15580

[SPARK-18042][SQL] OutputWriter should expose file path written #15580

Uh oh!

rxin commented Oct 21, 2016 •

edited

Loading

Uh oh!

rxin commented Oct 21, 2016

Uh oh!

cloud-fan Oct 21, 2016

Uh oh!

cloud-fan Oct 21, 2016

Uh oh!

cloud-fan Oct 21, 2016

Uh oh!

rxin Oct 21, 2016

Uh oh!

rxin Oct 21, 2016

Uh oh!

SparkQA commented Oct 21, 2016

Uh oh!

rxin commented Oct 21, 2016

Uh oh!

ericl Oct 21, 2016

Uh oh!

SparkQA commented Oct 21, 2016

Uh oh!

ericl commented Oct 22, 2016

Uh oh!

rxin commented Oct 22, 2016

Uh oh!

HyukjinKwon Oct 22, 2016

Uh oh!

HyukjinKwon commented Oct 22, 2016 •

edited

Loading

Uh oh!

rxin commented Oct 22, 2016

Uh oh!

Uh oh!

[SPARK-18042][SQL] OutputWriter should expose file path written #15580

[SPARK-18042][SQL] OutputWriter should expose file path written #15580

Uh oh!

Conversation

rxin commented Oct 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

rxin commented Oct 21, 2016

Uh oh!

cloud-fan Oct 21, 2016

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 21, 2016

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 21, 2016

Choose a reason for hiding this comment

Uh oh!

rxin Oct 21, 2016

Choose a reason for hiding this comment

Uh oh!

rxin Oct 21, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 21, 2016

Uh oh!

rxin commented Oct 21, 2016

Uh oh!

ericl Oct 21, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 21, 2016

Uh oh!

ericl commented Oct 22, 2016

Uh oh!

rxin commented Oct 22, 2016

Uh oh!

HyukjinKwon Oct 22, 2016

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Oct 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rxin commented Oct 22, 2016

Uh oh!

Uh oh!

rxin commented Oct 21, 2016 •

edited

Loading

HyukjinKwon commented Oct 22, 2016 •

edited

Loading