[SPARK-17661][SQL] Consolidate various listLeafFiles implementations #15235

petermaxlee · 2016-09-25T06:30:27Z

What changes were proposed in this pull request?

There are 4 listLeafFiles-related functions in Spark:

ListingFileCatalog.listLeafFiles (which calls HadoopFsRelation.listLeafFilesInParallel if the number of paths passed in is greater than a threshold; if it is lower, then it has its own serial version implemented)
HadoopFsRelation.listLeafFiles (called only by HadoopFsRelation.listLeafFilesInParallel)
HadoopFsRelation.listLeafFilesInParallel (called only by ListingFileCatalog.listLeafFiles)

It is actually very confusing and error prone because there are effectively two distinct implementations for the serial version of listing leaf files. As an example, SPARK-17599 updated only one of the code path and ignored the other one.

This code can be improved by:

Move all file listing code into ListingFileCatalog, since it is the only class that needs this.
Keep only one function for listing files in serial.

How was this patch tested?

This change should be covered by existing unit and integration tests. I also moved a test case for HadoopFsRelation.shouldFilterOut from HadoopFsRelationSuite to ListingFileCatalogSuite.

petermaxlee · 2016-09-25T06:30:52Z

@yhuai I think you wrote most of this. Can you take a look?

petermaxlee · 2016-09-25T06:31:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala

+  // well with `SerializableWritable`.  So there seems to be no way to serialize a `FileStatus`.
+  // Here we use `SerializableFileStatus` to extract key components of a `FileStatus` to serialize
+  // it from executor side and reconstruct it on driver side.
+  private case class SerializableBlockLocation(


I renamed this from "Fake" to "Serializable" to more accurately describe its purpose.

petermaxlee · 2016-09-25T06:32:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala

+      .parallelize(serializedPaths, numParallelism)
+      .mapPartitions { paths =>
+        val hadoopConf = serializableConfiguration.value
+        listLeafFilesInSerial(paths.map(new Path(_)).toSeq, hadoopConf).iterator


This function is very similar to the old listLeafFilesInParallel, except I replaced the code within this mapPartitions call with listLeafFilesInSerial

petermaxlee · 2016-09-25T06:33:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala

+  /**
+   * List a single path, provided as a FileStatus, in serial.
+   */
+  private def listLeafFiles0(


This is almost the same as the old HadoopFsRelation.listLeafFiles. The old code was:

def listLeafFiles(fs: FileSystem, status: FileStatus, filter: PathFilter): Array[FileStatus] = { logTrace(s"Listing ${status.getPath}") val name = status.getPath.getName.toLowerCase if (shouldFilterOut(name)) { Array.empty[FileStatus] } else { val statuses = { val (dirs, files) = fs.listStatus(status.getPath).partition(_.isDirectory) val stats = files ++ dirs.flatMap(dir => listLeafFiles(fs, dir, filter)) if (filter != null) stats.filter(f => filter.accept(f.getPath)) else stats } // statuses do not have any dirs. statuses.filterNot(status => shouldFilterOut(status.getPath.getName)).map { case f: LocatedFileStatus => f // NOTE: // // - Although S3/S3A/S3N file system can be quite slow for remote file metadata // operations, calling `getFileBlockLocations` does no harm here since these file system // implementations don't actually issue RPC for this method. // // - Here we are calling `getFileBlockLocations` in a sequential manner, but it should not // be a big deal since we always use to `listLeafFilesInParallel` when the number of // paths exceeds threshold. case f => createLocatedFileStatus(f, fs.getFileBlockLocations(f, 0, f.getLen)) } } } def createLocatedFileStatus(f: FileStatus, locations: Array[BlockLocation]): LocatedFileStatus = { // The other constructor of LocatedFileStatus will call FileStatus.getPermission(), which is // very slow on some file system (RawLocalFileSystem, which is launch a subprocess and parse the // stdout). val lfs = new LocatedFileStatus(f.getLen, f.isDirectory, f.getReplication, f.getBlockSize, f.getModificationTime, 0, null, null, null, null, f.getPath, locations) if (f.isSymlink) { lfs.setSymlink(f.getSymlink) } lfs }

petermaxlee · 2016-09-25T06:33:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala

  }

-  override def hashCode(): Int = paths.toSet.hashCode()
+  /** Checks if we should filter out this path name. */
+  def shouldFilterOut(pathName: String): Boolean = {


this is identical to the old code in HadoopFsRelation.shouldFilterOut

petermaxlee · 2016-09-25T06:34:12Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalogSuite.scala

+
+class ListingFileCatalogSuite extends SparkFunSuite {
+
+  test("file filtering") {


this was moved from HadoopFsRelationSuite without any change.

you may add this test to FileCatalogSuite if you like

That's counterintuitive. The code is defined in ListingFileCatalog, and should be tested in ListingFileCatalogSuite.

SparkQA · 2016-09-25T08:03:16Z

Test build #65876 has finished for PR 15235 at commit 2a76ec1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

petermaxlee · 2016-09-25T21:35:12Z

@brkyvz I think this also impacts the change you just did in #15153. This change makes both code path consistent.

SparkQA · 2016-09-25T23:28:46Z

Test build #65888 has finished for PR 15235 at commit 5c6a640.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz

My main comment is that the try catch block for SPARK-17599 is in the wrong place.

brkyvz · 2016-09-26T21:09:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala

+
+object ListingFileCatalog extends Logging {
+
+  // `FileStatus` is Writable but not serializable.  What make it worse, somehow it doesn't play


existing, but good to fix, the comment belongs on the class below not SerializableBlockLocation

brkyvz · 2016-09-26T21:12:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala

+      val fs = path.getFileSystem(hadoopConf)
+
+      // [SPARK-17599] Prevent ListingFileCatalog from failing if path doesn't exist
+      val status: Option[FileStatus] = try Option(fs.getFileStatus(path)) catch {


I don't think you need this. You are increasing the number of getStatus calls that we need to make. There is no guarantee that the folder will exist once listLeafFiles0 is called.

Let me take a look at this. This is actually consistent with the old code (for the parallel version). It is actually slightly tricky to remove this.

brkyvz · 2016-09-26T21:52:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala

+      .mapPartitions { paths =>
+        val hadoopConf = serializableConfiguration.value
+        listLeafFilesInSerial(paths.map(new Path(_)).toSeq, hadoopConf).iterator
+      }.map { status =>


why don't you just call map on the iterator but call it on the rdd?

This was pre-existing code.

brkyvz · 2016-09-26T21:55:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala

+      .parallelize(serializedPaths, numParallelism)
+      .mapPartitions { paths =>
+        val hadoopConf = serializableConfiguration.value
+        listLeafFilesInSerial(paths.map(new Path(_)).toSeq, hadoopConf).iterator


you shouldn't just call listLeafFilesInSerial here. It's more likely that one level down, you're going to have a bunch more directories that you may want to list, where you want more parallelization. You should iteratively list sub directories in parallel.

This was following the old behavior.

brkyvz · 2016-09-26T21:56:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala

+    } else {
+      val statuses = {
+        val (dirs, files) = fs.listStatus(status.getPath).partition(_.isDirectory)
+        val stats = files ++ dirs.flatMap(dir => listLeafFiles0(fs, dir, filter))


I think the directories should be submittable as a parallel job if we were told that we should parallelize file listing.

I see the old code path was just like this, maybe it's not necessary

Yea I don't see why we want to change all of these in this pull request, unless they are a problem.

brkyvz · 2016-09-26T21:57:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala

+      Seq.empty[FileStatus]
+    } else {
+      val statuses = {
+        val (dirs, files) = fs.listStatus(status.getPath).partition(_.isDirectory)


this is actually where you need the try-catch block to see if the file exists or not.

brkyvz · 2016-09-26T21:58:28Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalogSuite.scala

+
+class ListingFileCatalogSuite extends SparkFunSuite {
+
+  test("file filtering") {


you may add this test to FileCatalogSuite if you like

brkyvz · 2016-09-26T22:00:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala

+  // well with `SerializableWritable`.  So there seems to be no way to serialize a `FileStatus`.
+  // Here we use `SerializableFileStatus` to extract key components of a `FileStatus` to serialize
+  // it from executor side and reconstruct it on driver side.
+  private case class SerializableBlockLocation(


petermaxlee · 2016-09-26T23:18:48Z

I pushed a new version that should address all the outstanding issues.

SparkQA · 2016-09-27T01:16:09Z

Test build #65936 has finished for PR 15235 at commit 3c99c3c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

petermaxlee · 2016-09-28T01:51:53Z

@brkyvz does this look good?

brkyvz · 2016-10-13T21:15:06Z

@petermaxlee Thanks for making the change. This LGTM now that the fix for SPARK-17599 is in the right place. The rest is just moving around, consolidating old code.

rxin · 2016-10-13T21:16:00Z

Merging in master. Thanks!

## What changes were proposed in this pull request? There are 4 listLeafFiles-related functions in Spark: - ListingFileCatalog.listLeafFiles (which calls HadoopFsRelation.listLeafFilesInParallel if the number of paths passed in is greater than a threshold; if it is lower, then it has its own serial version implemented) - HadoopFsRelation.listLeafFiles (called only by HadoopFsRelation.listLeafFilesInParallel) - HadoopFsRelation.listLeafFilesInParallel (called only by ListingFileCatalog.listLeafFiles) It is actually very confusing and error prone because there are effectively two distinct implementations for the serial version of listing leaf files. As an example, SPARK-17599 updated only one of the code path and ignored the other one. This code can be improved by: - Move all file listing code into ListingFileCatalog, since it is the only class that needs this. - Keep only one function for listing files in serial. ## How was this patch tested? This change should be covered by existing unit and integration tests. I also moved a test case for HadoopFsRelation.shouldFilterOut from HadoopFsRelationSuite to ListingFileCatalogSuite. Author: petermaxlee <petermaxlee@gmail.com> Closes apache#15235 from petermaxlee/SPARK-17661.

[SPARK-17661][SQL] Consolidate various listLeafFiles implementations

2a76ec1

petermaxlee commented Sep 25, 2016

View reviewed changes

Fix tests

5c6a640

brkyvz suggested changes Sep 26, 2016

View reviewed changes

Code review

3c99c3c

asfgit closed this in adc1124 Oct 13, 2016


		class ListingFileCatalogSuite extends SparkFunSuite {

		test("file filtering") {


		object ListingFileCatalog extends Logging {

		// `FileStatus` is Writable but not serializable. What make it worse, somehow it doesn't play

[SPARK-17661][SQL] Consolidate various listLeafFiles implementations #15235

[SPARK-17661][SQL] Consolidate various listLeafFiles implementations #15235

Uh oh!

Conversation

petermaxlee commented Sep 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

petermaxlee commented Sep 25, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petermaxlee Sep 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 25, 2016

Uh oh!

petermaxlee commented Sep 25, 2016

Uh oh!

SparkQA commented Sep 25, 2016

Uh oh!

brkyvz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petermaxlee commented Sep 26, 2016

Uh oh!

SparkQA commented Sep 27, 2016

Uh oh!

petermaxlee commented Sep 28, 2016

Uh oh!

brkyvz commented Oct 13, 2016

Uh oh!

rxin commented Oct 13, 2016

Uh oh!

Uh oh!

petermaxlee commented Sep 25, 2016 •

edited

Loading

petermaxlee Sep 25, 2016 •

edited

Loading