[SPARK-18679] [SQL] Fix regression in file listing performance for non-catalog tables by ericl · Pull Request #16112 · apache/spark

ericl · 2016-12-02T02:44:25Z

What changes were proposed in this pull request?

In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g. spark.read.parquet(topLevelDir)), the top of the tree is only a single directory.

This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once we decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors).

cc @mallman @cloud-fan

How was this patch tested?

Checked metrics in unit tests.

cloud-fan · 2016-12-02T03:59:05Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala

  }

+  test("PartitioningAwareFileIndex listing parallelized with many top level dirs") {
+    for ((scale, expectedNumPar) <- Seq((10, 0), (50, 1))) {


shall we do withSQLConf(SQLConf.PARALLEL_PARTITION_DISCOVERY_THRESHOLD -> "xxx") { test code } to make the test more robust?

cloud-fan · 2016-12-02T04:02:57Z

LGTM, @ericl have you run some local benchmark to make sure the performance regression is fixed?

ericl · 2016-12-02T05:10:13Z

Yep

…

On Thu, Dec 1, 2016, 8:03 PM Wenchen Fan ***@***.***> wrote: LGTM, @ericl <https://github.com/ericl> have you run some local benchmark to make sure the performance regression is fixed? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16112 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAA6SojMWfKRhJ0p_h66YyQZNUkIpYaEks5rD5iJgaJpZM4LCIzn> .

SparkQA · 2016-12-02T06:10:40Z

Test build #69531 has finished for PR 16112 at commit db66439.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…-catalog tables ## What changes were proposed in this pull request? In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g. `spark.read.parquet(topLevelDir)`), the top of the tree is only a single directory. This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once we decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors). cc mallman cloud-fan ## How was this patch tested? Checked metrics in unit tests. Author: Eric Liang <ekl@databricks.com> Closes #16112 from ericl/spark-18679. (cherry picked from commit 294163e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2016-12-02T13:01:47Z

thanks, merging to master/2.1!

…-catalog tables ## What changes were proposed in this pull request? In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g. `spark.read.parquet(topLevelDir)`), the top of the tree is only a single directory. This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once we decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors). cc mallman cloud-fan ## How was this patch tested? Checked metrics in unit tests. Author: Eric Liang <ekl@databricks.com> Closes apache#16112 from ericl/spark-18679.

ericl added 2 commits December 1, 2016 18:35

Thu Dec 1 18:35:11 PST 2016

3102aa3

Thu Dec 1 18:37:23 PST 2016

db66439

ericl changed the title ~~[SPARK-18769] [SQL] Fix regression in file listing performance for non-catalog tables~~ [SPARK-18679] [SQL] Fix regression in file listing performance for non-catalog tables Dec 2, 2016

cloud-fan reviewed Dec 2, 2016

View reviewed changes

asfgit closed this in 294163e Dec 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18679] [SQL] Fix regression in file listing performance for non-catalog tables#16112

[SPARK-18679] [SQL] Fix regression in file listing performance for non-catalog tables#16112
ericl wants to merge 2 commits intoapache:masterfrom
ericl:spark-18679

ericl commented Dec 2, 2016 •

edited

Loading

Uh oh!

cloud-fan Dec 2, 2016

Uh oh!

cloud-fan commented Dec 2, 2016

Uh oh!

ericl commented Dec 2, 2016 via email

Uh oh!

SparkQA commented Dec 2, 2016

Uh oh!

cloud-fan commented Dec 2, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ericl commented Dec 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan Dec 2, 2016

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 2, 2016

Uh oh!

ericl commented Dec 2, 2016 via email

Uh oh!

SparkQA commented Dec 2, 2016

Uh oh!

cloud-fan commented Dec 2, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ericl commented Dec 2, 2016 •

edited

Loading