[SPARK-29876][SS] Delete/archive file source completed files in separate thread #26502

gaborgsomogyi · 2019-11-13T10:49:36Z

What changes were proposed in this pull request?

SPARK-20568 added the possibility to clean up completed files in streaming query. Deleting/archiving uses the main thread which can slow down processing. In this PR I've created thread pool to handle file delete/archival. The number of threads can be configured with spark.sql.streaming.fileSource.cleaner.numThreads.

Why are the changes needed?

Do file delete/archival in separate thread.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing unit tests.

…ate thread

gaborgsomogyi · 2019-11-13T13:13:59Z

The reason why I've added number of threads configuration possibility is the archiving part. Namely move can be copy on couple of file systems (like S3) which is time consuming and the cleaner maybe not able to keep the speed with the streaming query.

SparkQA · 2019-11-13T15:04:20Z

Test build #113697 has finished for PR 26502 at commit 7f14b55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2019-11-13T15:10:35Z

cc @vanzin @HeartSaVioR since you know SPARK-20568 well.

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

HeartSaVioR · 2019-11-13T23:19:32Z

docs/structured-streaming-programming-guide.md

@@ -550,7 +550,8 @@ Here are the details of all the sources in Spark.
        Available options are "archive", "delete", "off". If the option is not provided, the default value is "off".<br/>
        When "archive" is provided, additional option <code>sourceArchiveDir</code> must be provided as well. The value of "sourceArchiveDir" must have 2 subdirectories (so depth of directory is greater than 2). e.g. <code>/archived/here</code>. This will ensure archived files are never included as new source files.<br/>
        Spark will move source files respecting their own path. For example, if the path of source file is <code>/a/b/dataset.txt</code> and the path of archive directory is <code>/archived/here</code>, file will be moved to <code>/archived/here/a/b/dataset.txt</code>.<br/>
-        NOTE: Both archiving (via moving) or deleting completed files will introduce overhead (slow down) in each micro-batch, so you need to understand the cost for each operation in your file system before enabling this option. On the other hand, enabling this option will reduce the cost to list source files which can be an expensive operation.<br/>
+        NOTE: Both archiving (via moving) or deleting completed files will introduce overhead (slow down, even if it's happening in separate thread) in each micro-batch, so you need to understand the cost for each operation in your file system before enabling this option. On the other hand, enabling this option will reduce the cost to list source files which can be an expensive operation.<br/>
+        Number of threads used in completed file cleaner can be configured with<code>spark.sql.streaming.fileSource.cleaner.numThreads</code>.<br/>


We configure cleanSource in FileStreamSourceOption; it should be available in same place.

I think its implementation detail which not necessarily should appear on source option level. I would add it if per source configuration is required (but then maybe --conf can be used).

I see the benefit of placing it to configuration, but still don't feel intuitive to configure this to here and that to over there. Let's hear others' voices.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

SparkQA · 2019-11-20T16:23:33Z

Test build #114165 has finished for PR 26502 at commit b0b9714.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2019-11-20T17:01:45Z

retest this please

SparkQA · 2019-11-20T20:49:55Z

Test build #114173 has finished for PR 26502 at commit b0b9714.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR

I'd still like to hear others' voices for "source option" vs "SQL configuration", but other parts look great.

gaborgsomogyi · 2019-11-21T15:27:02Z

"source option" vs "SparkConf" is better I think.

gaborgsomogyi · 2019-11-25T13:34:58Z

@zsxwing this is an addition to the source file archival/delete which worth the attention.

vanzin

Don't have an opinion about the configuration. Other than maybe it should be by default just 1 thread.

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

gaborgsomogyi · 2019-12-03T12:48:26Z

Other than maybe it should be by default just 1 thread.

Set it to 1 + moved the clean functionality into the base class.

SparkQA · 2019-12-03T17:15:35Z

Test build #114777 has finished for PR 26502 at commit 19b9c92.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

docs/structured-streaming-programming-guide.md

vanzin · 2019-12-04T00:40:52Z

Do the tests need any adjustment given that the cleaning is asynchronous now? (Maybe to disable the cleaning thread.)

HeartSaVioR · 2019-12-04T01:28:54Z

Do the tests need any adjustment given that the cleaning is asynchronous now? (Maybe to disable the cleaning thread.)

How about modifying clean to call cleanTask synchronously if it's being called from UT? (Would checking IS_TESTING in SparkConf work like what we do with FsHistoryProvider?) Given we have some UTs which don't directly call cleaner.clean, we still have to run cleanup in UTs.

Alternatively we can check the files with eventually but it would add some latencies on verification.

gaborgsomogyi · 2020-01-08T12:37:37Z

Started to catch-up which will take some time and going to continue...

gaborgsomogyi · 2020-01-14T16:15:45Z

Yeah, the thread must be turned off not to have flakyness. IS_TESTING is something what I would like to avoid unless there is no other way. eventually has a drawback of additional latency as you've mentioned @HeartSaVioR . I think the easiest and simplest is to set the config to 0. I'm going to add this change in the last commit.

* Add doc default * Add protected * Turn off threads in tests

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

HeartSaVioR

LGTM except the nit commented.

SparkQA · 2020-01-14T19:58:39Z

Test build #116716 has finished for PR 26502 at commit 74431de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-14T20:50:17Z

Test build #116718 has finished for PR 26502 at commit b6af107.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-15T13:42:01Z

Test build #116776 has finished for PR 26502 at commit e3cb6e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

SparkQA · 2020-01-16T16:32:03Z

Test build #116853 has finished for PR 26502 at commit 3726b08.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2020-01-17T18:45:17Z

Merging to master.

[SPARK-29876][SS] Delete/archive file source completed files in separ…

7f14b55

…ate thread

dongjoon-hyun added the STRUCTURED STREAMING label Nov 13, 2019

HeartSaVioR reviewed Nov 13, 2019

View reviewed changes

Review fix

b0b9714

HeartSaVioR reviewed Nov 21, 2019

View reviewed changes

vanzin reviewed Nov 27, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala Outdated Show resolved Hide resolved

Move clean to base class + numThreads default set to 1

19b9c92

vanzin reviewed Dec 4, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala Outdated Show resolved Hide resolved

docs/structured-streaming-programming-guide.md Outdated Show resolved Hide resolved

Merge branch 'master' into SPARK-29876

74431de

Fixes:

b6af107

* Add doc default * Add protected * Turn off threads in tests

vanzin reviewed Jan 14, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala Outdated Show resolved Hide resolved

HeartSaVioR approved these changes Jan 14, 2020

View reviewed changes

Fix

e3cb6e0

vanzin reviewed Jan 15, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala Outdated Show resolved Hide resolved

Fix

3726b08

vanzin closed this in abf759a Jan 17, 2020

[SPARK-29876][SS] Delete/archive file source completed files in separate thread #26502

[SPARK-29876][SS] Delete/archive file source completed files in separate thread #26502

Uh oh!

Conversation

gaborgsomogyi commented Nov 13, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gaborgsomogyi commented Nov 13, 2019

Uh oh!

SparkQA commented Nov 13, 2019

Uh oh!

gaborgsomogyi commented Nov 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

HeartSaVioR Nov 13, 2019

Choose a reason for hiding this comment

Uh oh!

gaborgsomogyi Nov 14, 2019

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Nov 15, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Nov 20, 2019

Uh oh!

gaborgsomogyi commented Nov 20, 2019

Uh oh!

SparkQA commented Nov 20, 2019

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

gaborgsomogyi commented Nov 21, 2019

Uh oh!

gaborgsomogyi commented Nov 25, 2019

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gaborgsomogyi commented Dec 3, 2019

Uh oh!

SparkQA commented Dec 3, 2019

Uh oh!

Uh oh!

Uh oh!

vanzin commented Dec 4, 2019

Uh oh!

HeartSaVioR commented Dec 4, 2019

Uh oh!

gaborgsomogyi commented Jan 8, 2020

Uh oh!

gaborgsomogyi commented Jan 14, 2020

Uh oh!

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 14, 2020

Uh oh!

SparkQA commented Jan 14, 2020

Uh oh!

SparkQA commented Jan 15, 2020

Uh oh!

Uh oh!

SparkQA commented Jan 16, 2020

Uh oh!

vanzin commented Jan 17, 2020

Uh oh!

Uh oh!

gaborgsomogyi commented Nov 13, 2019 •

edited

Loading