[Spark-21996][SQL] read files with space in name for streaming #19247

xysun · 2017-09-15T14:21:37Z

What changes were proposed in this pull request?

Structured streaming is now able to read files with space in file name (previously it would skip the file and output a warning)

How was this patch tested?

Added new unit test.

xysun · 2017-09-15T14:25:50Z

To handle file names with special characters, we should use URI.getPath to get decoded path, instead of using toString, which may contain other characters different from original path.

java doc here

While this change fix the specific issue raised, I did a search and find multiple places in spark code where URI.toString is used. Should this be a concern?

xysun · 2017-09-19T03:06:20Z

@Joseph-Torres @brkyvz @lw-lin can you please take a look? (sorry for uninvited mentions but i just took the latest commits on FileStreamSource)

mgaido91 · 2017-10-02T13:19:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

@@ -233,7 +233,7 @@ class FileStreamSource(
    }

    val files = allFiles.sortBy(_.getModificationTime)(fileSortOrder).map { status =>
-      (status.getPath.toUri.toString, status.getModificationTime)
+      (status.getPath.toUri.getPath, status.getModificationTime)


why not status.getPath.toString?

ping @xysun ^

The correct fix is fixing this line: https://github.com/xysun/spark/blob/4f5979a72ce9cb36a3327e79b8592b9e42bdf5af/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L168

It should be files.map(new Path(new URI(_.path)).toString).

FYI, getPath will drop scheme and credentials.

Wait .. @zsxwing do you mean getPath (org.apache.hadoop.fs.Path) from org.apache.hadoop.fs.FileStatus drops the scheme and credentials?

Here seems Seq[org.apache.hadoop.fs.FileStatus] and the code you pointed out looks Array[org.apache.spark.sql.execution.streaming.FileStreamSource.FileEntry].

Ohaa, getPath from java.net.URI in the proposed fix drops them. Sure.

Thanks. I'll update the commit.

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

xysun · 2018-01-15T05:30:52Z

hi @HyukjinKwon @zsxwing @mgaido91 i have updated code according to the comments, also merged with latest master. Please review. Thanks.

HyukjinKwon · 2018-01-15T05:56:41Z

ok to test

SparkQA · 2018-01-15T08:05:01Z

Test build #86130 has finished for PR 19247 at commit 2542014.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-01-15T08:13:17Z

retest this please

mgaido91 · 2018-01-15T08:28:44Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

@@ -408,6 +420,18 @@ class FileStreamSourceSuite extends FileStreamSourceTest {
    }
  }

+  test("SPARK-21996 read from text files -- file name has space") {


can we run the same test also for the other input format, ie. parquet, orc, ... ?

this test should be enough. The issue is in file stream source.

yes, for this PR it is, but it would be great if we can ensure that all the data sources have the same behavior... Maybe we can do this is another PR if you think it is better

Not. I meant it's not an issue of file formats. There are not some special codes in file stream source. If there should be any tests for such issue, they should be inside file format tests.

SparkQA · 2018-01-15T11:25:29Z

Test build #86132 has finished for PR 19247 at commit 2542014.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

Thanks for your PR. Could you also add a test to read files generated by file sink, such as

  test("SPARK-21996 read from text files generated by file sink -- file name has space") {
    val testTableName = "FileStreamSourceTest"
    withTable(testTableName) {
      withTempDirs { case (src, checkpoint) =>
        val output = new File(src, "text text")
        val inputData = MemoryStream[String]
        val ds = inputData.toDS()

        val query = ds.writeStream
          .option("checkpointLocation", checkpoint.getCanonicalPath)
          .format("text")
          .start(output.getCanonicalPath)

        try {
          inputData.addData("foo")
          failAfter(streamingTimeout) {
            query.processAllAvailable()
          }
        } finally {
          query.stop()
        }

        val df2 = spark.readStream.format("text").load(output.getCanonicalPath)
        val query2 = df2.writeStream.format("memory").queryName(testTableName).start()
        try {
          query2.processAllAvailable()
          checkDatasetUnorderly(spark.table(testTableName).as[String], "foo")
        } finally {
          query2.stop()
        }
      }
    }
  }

zsxwing · 2018-01-16T21:31:27Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

@@ -86,6 +86,18 @@ abstract class FileStreamSourceTest
    }
  }

+  case class AddTextFileDataWithSpaceInFileName(content: String, src: File, tmp: File)


I would suggest that adding a new parameter to AddTextFileData rather than introducing a new class, such as

case class AddTextFileData(content: String, src: File, tmp: File, tempFilePrefix: String = "text")

sure will update and add the test for file sink. Thanks for the review.

xysun · 2018-01-17T04:48:36Z

Hi @zsxwing I have pushed latest changes (for file sink I'll be honest I simply copied your code =p)
I also verified that both tests would fail without the fix.
Please review. Thanks.

SparkQA · 2018-01-17T04:49:41Z

Test build #86237 has finished for PR 19247 at commit 7342d6c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-17T05:04:47Z

Test build #86241 has finished for PR 19247 at commit 04c2b14.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class AddTextFileData(content: String, src: File, tmp: File, tmpFileNamePrefix: String = \"text\")

SparkQA · 2018-01-17T08:05:01Z

Test build #86243 has finished for PR 19247 at commit 10106b3.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class AddTextFileData(content: String, src: File, tmp: File, tmpFilePrefix: String = \"text\")

xysun · 2018-01-17T10:24:02Z

retest this please

ok so SparkQA does not listen to me :/

zsxwing · 2018-01-17T19:40:10Z

retest this please

SparkQA · 2018-01-17T23:02:25Z

Test build #86288 has finished for PR 19247 at commit 10106b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class AddTextFileData(content: String, src: File, tmp: File, tmpFilePrefix: String = \"text\")

zsxwing · 2018-01-17T23:08:56Z

Thanks! Merging to master and 2.3.

## What changes were proposed in this pull request? Structured streaming is now able to read files with space in file name (previously it would skip the file and output a warning) ## How was this patch tested? Added new unit test. Author: Xiayun Sun <xiayunsun@gmail.com> Closes #19247 from xysun/SPARK-21996. (cherry picked from commit 0219470) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>

xysun added 2 commits September 15, 2017 14:51

add test to reproduce

38901ef

use URI.getPath instead of toString

4f5979a

mgaido91 reviewed Oct 2, 2017

View reviewed changes

xysun added 2 commits January 15, 2018 11:59

Merge branch 'master' into SPARK-21996

435f42b

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

fix per comment

2542014

mgaido91 reviewed Jan 15, 2018

View reviewed changes

zsxwing requested changes Jan 16, 2018

View reviewed changes

xysun added 2 commits January 17, 2018 11:42

pass in file name prefix for AddTextFileData instead of a new class

f07ccbe

test case for text files generated by file sink

7342d6c

xysun added 2 commits January 17, 2018 11:52

fix scalastyle

6e4f809

more scalastyle fix

04c2b14

one more scalastyle..

10106b3

asfgit closed this in 0219470 Jan 18, 2018

[Spark-21996][SQL] read files with space in name for streaming #19247

[Spark-21996][SQL] read files with space in name for streaming #19247

Uh oh!

Conversation

xysun commented Sep 15, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

xysun commented Sep 15, 2017

Uh oh!

xysun commented Sep 19, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xysun commented Jan 15, 2018

Uh oh!

HyukjinKwon commented Jan 15, 2018

Uh oh!

SparkQA commented Jan 15, 2018

Uh oh!

HyukjinKwon commented Jan 15, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 15, 2018

Uh oh!

zsxwing left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zsxwing Jan 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xysun commented Jan 17, 2018

Uh oh!

SparkQA commented Jan 17, 2018

Uh oh!

SparkQA commented Jan 17, 2018

Uh oh!

SparkQA commented Jan 17, 2018

Uh oh!

xysun commented Jan 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zsxwing commented Jan 17, 2018

Uh oh!

SparkQA commented Jan 17, 2018

Uh oh!

zsxwing commented Jan 17, 2018

Uh oh!

Uh oh!

zsxwing left a comment •

edited

Loading

zsxwing Jan 16, 2018 •

edited

Loading

xysun commented Jan 17, 2018 •

edited

Loading