-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-18042][SQL] OutputWriter should expose file path written #15580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
and @ericl |
new SerializableConfiguration(conf) | ||
} | ||
|
||
/** | ||
* Returns a [[OutputWriter]] that writes data to the give path without using | ||
* [[OutputCommitter]]. | ||
*/ | ||
override def newWriter(path: String): OutputWriter = new OutputWriter { | ||
override def newWriter(path1: String): OutputWriter = new OutputWriter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about _path
? path1
looks weird...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe we can create a class for this OutputWriter
here.
new Path(stagingDir, fileNamePrefix + extension) | ||
} | ||
new ParquetOutputFormat[InternalRow]() { | ||
override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now we never use the context
and extension
parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add some documentation in my next pull request.
Test build #67330 has finished for PR 15580 at commit
|
cc @HyukjinKwon |
* The path of the file to be written out. This path should include the staging directory and | ||
* the file name prefix passed into the associated createOutputWriter function. | ||
*/ | ||
def path: String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fullOutputPath
?
Test build #67344 has finished for PR 15580 at commit
|
lgtm |
Thanks - I'm going to merge this. I will address the doc and naming comment in the next pr in this series. |
@@ -35,7 +35,7 @@ private[parquet] class ParquetOptions( | |||
* Compression codec to use. By default use the value specified in SQLConf. | |||
* Acceptable values are defined in [[shortParquetCompressionCodecNames]]. | |||
*/ | |||
val compressionCodec: String = { | |||
val compressionCodecClassName: String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin This is a super minor but there are the same options to potentially rename in OrcOptions
, JsonOptions
, CSVOptions
and TextFileFormat
- TextFileFormat.scala#L71.
Also, I'd like to note, just in case, that the value in this is actually not the class name (it's something like SNAPPY
and LZO
) in case of ParquetOptions
and OrcOptions
whereas text-based ones are actual class names.
@HyukjinKwon sure. |
## What changes were proposed in this pull request? This patch adds a new "path" method on OutputWriter that returns the path of the file written by the OutputWriter. This is part of the necessary work to consolidate structured streaming and batch write paths. The batch write path has a nice feature that each data source can define the extension of the files, and allow Spark to specify the staging directory and the prefix for the files. However, in the streaming path we need to collect the list of files written, and there is no interface right now to do that. ## How was this patch tested? N/A - there is no behavior change and this should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes apache#15580 from rxin/SPARK-18042.
## What changes were proposed in this pull request? This patch adds a new "path" method on OutputWriter that returns the path of the file written by the OutputWriter. This is part of the necessary work to consolidate structured streaming and batch write paths. The batch write path has a nice feature that each data source can define the extension of the files, and allow Spark to specify the staging directory and the prefix for the files. However, in the streaming path we need to collect the list of files written, and there is no interface right now to do that. ## How was this patch tested? N/A - there is no behavior change and this should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes apache#15580 from rxin/SPARK-18042.
What changes were proposed in this pull request?
This patch adds a new "path" method on OutputWriter that returns the path of the file written by the OutputWriter. This is part of the necessary work to consolidate structured streaming and batch write paths.
The batch write path has a nice feature that each data source can define the extension of the files, and allow Spark to specify the staging directory and the prefix for the files. However, in the streaming path we need to collect the list of files written, and there is no interface right now to do that.
How was this patch tested?
N/A - there is no behavior change and this should be covered by existing tests.