Skip to content

[SPARK-18042][SQL] OutputWriter should expose file path written #15580

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

rxin
Copy link
Contributor

@rxin rxin commented Oct 21, 2016

What changes were proposed in this pull request?

This patch adds a new "path" method on OutputWriter that returns the path of the file written by the OutputWriter. This is part of the necessary work to consolidate structured streaming and batch write paths.

The batch write path has a nice feature that each data source can define the extension of the files, and allow Spark to specify the staging directory and the prefix for the files. However, in the streaming path we need to collect the list of files written, and there is no interface right now to do that.

How was this patch tested?

N/A - there is no behavior change and this should be covered by existing tests.

@rxin
Copy link
Contributor Author

rxin commented Oct 21, 2016

cc @hvanhovell @cloud-fan

and @ericl

new SerializableConfiguration(conf)
}

/**
* Returns a [[OutputWriter]] that writes data to the give path without using
* [[OutputCommitter]].
*/
override def newWriter(path: String): OutputWriter = new OutputWriter {
override def newWriter(path1: String): OutputWriter = new OutputWriter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about _path? path1 looks weird...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe we can create a class for this OutputWriter here.

new Path(stagingDir, fileNamePrefix + extension)
}
new ParquetOutputFormat[InternalRow]() {
override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now we never use the context and extension parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add some documentation in my next pull request.

@SparkQA
Copy link

SparkQA commented Oct 21, 2016

Test build #67330 has finished for PR 15580 at commit 1942361.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor Author

rxin commented Oct 21, 2016

cc @HyukjinKwon

* The path of the file to be written out. This path should include the staging directory and
* the file name prefix passed into the associated createOutputWriter function.
*/
def path: String
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fullOutputPath?

@SparkQA
Copy link

SparkQA commented Oct 21, 2016

Test build #67344 has finished for PR 15580 at commit d3ddaf7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ericl
Copy link
Contributor

ericl commented Oct 22, 2016

lgtm

@rxin
Copy link
Contributor Author

rxin commented Oct 22, 2016

Thanks - I'm going to merge this. I will address the doc and naming comment in the next pr in this series.

@asfgit asfgit closed this in 3fbf5a5 Oct 22, 2016
@@ -35,7 +35,7 @@ private[parquet] class ParquetOptions(
* Compression codec to use. By default use the value specified in SQLConf.
* Acceptable values are defined in [[shortParquetCompressionCodecNames]].
*/
val compressionCodec: String = {
val compressionCodecClassName: String = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin This is a super minor but there are the same options to potentially rename in OrcOptions, JsonOptions, CSVOptions and TextFileFormat - TextFileFormat.scala#L71.

Also, I'd like to note, just in case, that the value in this is actually not the class name (it's something like SNAPPY and LZO) in case of ParquetOptions and OrcOptions whereas text-based ones are actual class names.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Oct 22, 2016

@rxin BTW, could I please ask to include #14529 in your future related PRs (if it looks reasonable)? I will close if you will. (Of course it is also fine to leave it if you are uncertain of the change).

@rxin
Copy link
Contributor Author

rxin commented Oct 22, 2016

@HyukjinKwon sure.

robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
## What changes were proposed in this pull request?
This patch adds a new "path" method on OutputWriter that returns the path of the file written by the OutputWriter. This is part of the necessary work to consolidate structured streaming and batch write paths.

The batch write path has a nice feature that each data source can define the extension of the files, and allow Spark to specify the staging directory and the prefix for the files. However, in the streaming path we need to collect the list of files written, and there is no interface right now to do that.

## How was this patch tested?
N/A - there is no behavior change and this should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes apache#15580 from rxin/SPARK-18042.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?
This patch adds a new "path" method on OutputWriter that returns the path of the file written by the OutputWriter. This is part of the necessary work to consolidate structured streaming and batch write paths.

The batch write path has a nice feature that each data source can define the extension of the files, and allow Spark to specify the staging directory and the prefix for the files. However, in the streaming path we need to collect the list of files written, and there is no interface right now to do that.

## How was this patch tested?
N/A - there is no behavior change and this should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes apache#15580 from rxin/SPARK-18042.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants