[SPARK-17569] Make StructuredStreaming FileStreamSource batch generation faster #15122

brkyvz · 2016-09-16T23:05:23Z

What changes were proposed in this pull request?

While getting the batch for a FileStreamSource in StructuredStreaming, we know which files we must take specifically. We already have verified that they exist, and have committed them to a metadata log. When creating the FileSourceRelation however for an incremental execution, the code checks the existence of every single file once again!

When you have 100,000s of files in a folder, creating the first batch takes 2 hours+ when working with S3! This PR disables that check

How was this patch tested?

Added a unit test to FileStreamSource.

brkyvz · 2016-09-16T23:06:09Z

cc @zsxwing @yhuai

yhuai · 2016-09-16T23:11:04Z

also cc @cloud-fan

SparkQA · 2016-09-17T00:59:48Z

Test build #65512 has finished for PR 15122 at commit 9b7e2de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

petermaxlee · 2016-09-18T05:16:55Z

Can you test this by deleting the file on purpose, and see what kind of exceptions are thrown?

petermaxlee · 2016-09-18T05:52:11Z

I looked into this. I think there are two ways that you can intercept any calls to HDFS.

The first way is slightly hacky but pretty simple. FileSystem.addFileSystemForTesting is a package private method that can be used to inject a mock file system. You can create an implementation of FilterFileSystem and pass it in as "file" schema. Then all accesses to local file system will go through your implementation. Of course, you can also use a mocking library to do that, but that is not as clean since FilterFileSystem is a public interface.

The second way is more robust and does not depend on any private APIs. Create an implementation of FilterFileSystem by pointing to LocalFileSystem, e.g. call it MockFileSystem. MockFileSystem.getScheme should return "mockfs://". You can then use this as the path when passing to structured streaming. This is probably a more robust, generic solution.

There is also the possibility of depending on the ordering of how FileSystem and the class loader loads classes -- but I wouldn't recommend that.

yhuai · 2016-09-19T00:31:22Z

@petermaxlee I believe you will get a runtime exception saying that the file does not exist.

Also, regarding your options 2, are you suggesting that users of structured streaming to use such a mock fs? Or you are suggesting that structured streaming to use such a fs. Also, why LocalFS is related to this case?

brkyvz · 2016-09-19T15:55:10Z

@yhuai The suggestions are for purely testing purposes, to make sure that StructuredStreaming doesn't check for file existence twice.

brkyvz · 2016-09-19T15:58:31Z

@petermaxlee Thank you for the suggestions for testing. I will try out Option 1, since 2 is a bit much work for a minor PR as this.

yhuai · 2016-09-19T17:13:27Z

ok, got it. Thanks!

brkyvz · 2016-09-19T17:37:07Z

Added test using Option 2 in the end.

SparkQA · 2016-09-19T19:36:31Z

Test build #65604 has finished for PR 15122 at commit 0cf9c08.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ExistsThrowsExceptionFileSystem extends RawLocalFileSystem

SparkQA · 2016-09-19T19:50:12Z

Test build #65605 has finished for PR 15122 at commit 6221b37.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FileStreamSourceSuite extends SparkFunSuite with SharedSQLContext

zsxwing · 2016-09-21T18:26:24Z

LGTM. Let's run the test again since there are several PRs about FileStreamSource got merged in the past two days.

SparkQA · 2016-09-21T20:10:15Z

Test build #3285 has finished for PR 15122 at commit 6221b37.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FileStreamSourceSuite extends SparkFunSuite with SharedSQLContext

zsxwing · 2016-09-21T20:29:15Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceSuite.scala

+        classOf[ExistsThrowsExceptionFileSystem].getName)
+      // add the metadata entries as a pre-req
+      val dir = new File(temp, "dir") // use non-existent directory to test whether log make the dir
+      val metadataLog = new HDFSMetadataLog[Array[FileEntry]](spark, dir.getAbsolutePath)


Need to use FileStreamSourceLog here to add FileEntry as the latest master uses it to compact File source's logs.

brkyvz · 2016-09-21T21:57:38Z

@zsxwing Thanks for the comment. Updated!

SparkQA · 2016-09-22T00:00:06Z

Test build #65736 has finished for PR 15122 at commit 666c2c5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-09-22T00:12:20Z

LGTM. Merging to master and ~~2.0~~. Thanks!

zsxwing · 2016-09-22T00:13:54Z

There are some conflicts with 2.0. Could you submit a patch for branch 2.0?

## What changes were proposed in this pull request? A [PR](apache@a6aade0) was merged concurrently that made the unit test for PR apache#15122 not test anything anymore. This PR fixes the test. ## How was this patch tested? Changed line https://github.com/apache/spark/blob/0d634875026ccf1eaf984996e9460d7673561f80/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L137 from `false` to `true` and made sure the unit test failed. Author: Burak Yavuz <brkyvz@gmail.com> Closes apache#15203 from brkyvz/fix-test.

## What changes were proposed in this pull request? This Backports PR #15153 and PR #15122 to Spark 2.0 branch for Structured Streaming. It is structured a bit differently because similar code paths already existed in the 2.0 branch. The unit test makes sure that both behaviors don't break. Author: Burak Yavuz <brkyvz@gmail.com> Closes #15202 from brkyvz/backports-to-streaming.

make StructuredStreaming fileSource batch generation faster

9b7e2de

Add test

0cf9c08

clean up

6221b37

brkyvz mentioned this pull request Sep 21, 2016

[SPARK-17599] Prevent ListingFileCatalog from failing if path doesn't exist #15153

Closed

zsxwing reviewed Sep 21, 2016

View reviewed changes

brkyvz added 2 commits September 21, 2016 14:56

Merge branch 'master' of github.com:apache/spark into SPARK-17569

ace2f2e

address

666c2c5

asfgit closed this in 7cbe216 Sep 22, 2016

This was referenced Sep 22, 2016

Backport SPARK-17599 and SPARK-17569 to Spark 2.0 branch #15202

Closed

[TEST][SPARK-17569] Make the unit test added for SPARK-17569 work again #15203

Closed

brkyvz deleted the SPARK-17569 branch February 3, 2019 20:54

[SPARK-17569] Make StructuredStreaming FileStreamSource batch generation faster #15122

[SPARK-17569] Make StructuredStreaming FileStreamSource batch generation faster #15122

Uh oh!

Conversation

brkyvz commented Sep 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

brkyvz commented Sep 16, 2016

Uh oh!

yhuai commented Sep 16, 2016

Uh oh!

SparkQA commented Sep 17, 2016

Uh oh!

petermaxlee commented Sep 18, 2016

Uh oh!

petermaxlee commented Sep 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yhuai commented Sep 19, 2016

Uh oh!

brkyvz commented Sep 19, 2016

Uh oh!

brkyvz commented Sep 19, 2016

Uh oh!

yhuai commented Sep 19, 2016

Uh oh!

brkyvz commented Sep 19, 2016

Uh oh!

SparkQA commented Sep 19, 2016

Uh oh!

SparkQA commented Sep 19, 2016

Uh oh!

zsxwing commented Sep 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 21, 2016

Uh oh!

zsxwing Sep 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brkyvz commented Sep 21, 2016

Uh oh!

SparkQA commented Sep 22, 2016

Uh oh!

zsxwing commented Sep 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zsxwing commented Sep 22, 2016

Uh oh!

Uh oh!

brkyvz commented Sep 16, 2016 •

edited

Loading

petermaxlee commented Sep 18, 2016 •

edited

Loading

zsxwing commented Sep 21, 2016 •

edited

Loading

zsxwing Sep 21, 2016 •

edited

Loading

zsxwing commented Sep 22, 2016 •

edited

Loading