[SPARK-26161][SQL] Ignore empty files in load #23130

MaxGekk · 2018-11-24T17:54:52Z

What changes were proposed in this pull request?

In the PR, I propose filtering out all empty files inside of FileSourceScanExec and exclude them from file splits. It should reduce overhead of opening and reading files without any data, and as consequence datasources will not produce empty partitions for such files.

How was this patch tested?

Added a test which creates an empty and non-empty files. If empty files are ignored in load, Text datasource in the wholetext mode must create only one partition for non-empty file.

MaxGekk · 2018-11-24T17:58:37Z

@cloud-fan @HyukjinKwon Please, take a look at the PR.

SparkQA · 2018-11-24T21:29:39Z

Test build #99227 has finished for PR 23130 at commit b200a50.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/test/scala/org/apache/spark/sql/sources/SaveLoadSuite.scala

SparkQA · 2018-11-25T18:54:47Z

Test build #99239 has finished for PR 23130 at commit e7871f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-26T02:54:32Z

The code change LGTM. There is a mistake in PR description: we updated FileSourceScanExec not DataSourceScanExec. Let's also mention that this fixed a behavior change introduced by #22938 mistakenly.

gatorsmile · 2018-11-26T06:37:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

@@ -388,7 +388,7 @@ case class FileSourceScanExec(
    logInfo(s"Planning with ${bucketSpec.numBuckets} buckets")
    val filesGroupedToBuckets =
      selectedPartitions.flatMap { p =>
-        p.files.map { f =>
+        p.files.filter(_.getLen > 0).map { f =>


do the filtering inside the map?

Do we have a test case for this line?

do you mean changing filter...map... to flatMap? I don't have a strong preference about it.

The updated test cases and the new test case are for this change.

I personally prefer filter + map as it's shorter and clearer. I don't know if one is faster; two transformations vs having to return Some/None. For a Dataset operation I'd favor one operation, but this is just local Scala code.

It's non-critical path in terms of performance. Should be okay.

This createBucketedReadRDD is for the bucket table, right?

yes, and the same change is also in createNonBucketedReadRDD

srowen · 2018-11-27T15:10:57Z

sql/core/src/test/scala/org/apache/spark/sql/sources/SaveLoadSuite.scala

+    withTempDir { dir =>
+      val path = dir.getCanonicalPath
+      Files.write(Paths.get(path, "empty"), Array.empty[Byte])
+      Files.write(Paths.get(path, "notEmpty"), "a".getBytes)


Nit for consistency: .getBytes(StandardCharsets.UTF_8)

srowen · 2018-11-27T15:11:29Z

sql/core/src/test/scala/org/apache/spark/sql/sources/SaveLoadSuite.scala

+      Files.write(Paths.get(path, "notEmpty"), "a".getBytes)
+      val readback = spark.read.option("wholetext", true).text(path)
+
+      assert(readback.rdd.getNumPartitions == 1)


Do we need 1 === ... to get the right assert message? it's tiny.

It seems expected value should be on right. I changed the order and got the following:

assert(123 === readback.rdd.getNumPartitions)

123 did not equal 1 ScalaTestFailureLocation: org.apache.spark.sql.sources.SaveLoadSuite at (SaveLoadSuite.scala:155) Expected :1 Actual :123

Current assert triggers correct message:

assert(readback.rdd.getNumPartitions == 123)

1 did not equal 123 ScalaTestFailureLocation: org.apache.spark.sql.sources.SaveLoadSuite at (SaveLoadSuite.scala:155) Expected :123 Actual :1

I am just referring to === and the order of args. I'm sure the test was right as-is in what it asserts.

SparkQA · 2018-11-27T19:15:41Z

Test build #99335 has finished for PR 23130 at commit 7057f8b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-28T11:34:10Z

sql/core/src/test/scala/org/apache/spark/sql/sources/SaveLoadSuite.scala

+      Files.write(Paths.get(path, "notEmpty"), "a".getBytes(StandardCharsets.UTF_8))
+      val readback = spark.read.option("wholetext", true).text(path)
+
+      assert(readback.rdd.getNumPartitions === 1)


does this test fail without your change? IIUC one partition can read multiple files. Is JSON the only data source that may return a row for empty file?

does this test fail without your change?

Yes, it does due to the wholetext.

Is JSON the only data source that may return a row for empty file?

We depend on underlying parser here. I will check CSV and Text.

do you mean wholetext mode will force to create one partition per file?

I think so, wholetext makes files not splittable:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala

Line 57 in 46110a5

super.isSplitable(sparkSession, options, path) && !textOptions.wholeText

This can guarantee ( in text datasources at least) one file -> one partition.

IIUC one partition can read multiple files.

Do you mean this code?

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

Lines 459 to 464 in 8c68718

if (currentSize + file.length > maxSplitBytes) {

closePartition()

}

// Add the given file to the current partition.

currentSize += file.length + openCostInBytes

currentFiles += file

thanks for pointing it out, I think we are good here.

cloud-fan · 2018-11-28T15:38:28Z

I think this change makes sense, at least it's good for performance. My only concern is, shall we ask all the parsers to return Nil for empty files? AFAIK JSON doesn't follow it.

SparkQA · 2018-11-29T21:02:36Z

Test build #99465 has finished for PR 23130 at commit 1f58cc1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

@cloud-fan are you OK with this?

cloud-fan · 2018-12-02T02:29:18Z

We don't need to block it, but @MaxGekk if you have time, it would great to answer #23130 (comment)

thanks, merging to master!

## What changes were proposed in this pull request? In the PR, I propose filtering out all empty files inside of `FileSourceScanExec` and exclude them from file splits. It should reduce overhead of opening and reading files without any data, and as consequence datasources will not produce empty partitions for such files. ## How was this patch tested? Added a test which creates an empty and non-empty files. If empty files are ignored in load, Text datasource in the `wholetext` mode must create only one partition for non-empty file. Closes apache#23130 from MaxGekk/ignore-empty-files. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… on listing files In apache#23130, all empty files are excluded from target file splits in `FileSourceScanExec`. In File source V2, we should keep the same behavior. This PR suggests to filter out empty files on listing files in `PartitioningAwareFileIndex` so that the upper level doesn't need to handle them. Unit test Closes apache#24227 from gengliangwang/ignoreEmptyFile. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

MaxGekk added 5 commits November 24, 2018 18:29

Test checks empty files are not loaded

36c64f0

Fix json tests

e428b83

Fix test

4212add

Filtering out empty files

2458882

Fix imports

b200a50

MaxGekk mentioned this pull request Nov 24, 2018

[SPARK-25935][SQL] Prevent null rows from JSON parser #22938

Closed

HyukjinKwon reviewed Nov 25, 2018

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/sources/SaveLoadSuite.scala Outdated Show resolved Hide resolved

Replacing Array[Byte]() by Array.empty[Byte]

e7871f3

srowen approved these changes Nov 25, 2018

View reviewed changes

HyukjinKwon approved these changes Nov 26, 2018

View reviewed changes

gatorsmile reviewed Nov 26, 2018

View reviewed changes

srowen reviewed Nov 27, 2018

View reviewed changes

Addressing Sean's review comments

7057f8b

cloud-fan reviewed Nov 28, 2018

View reviewed changes

MaxGekk mentioned this pull request Nov 28, 2018

[SPARK-26081][SQL] Prevent empty files for empty partitions in Text datasources #23052

Closed

Merge branch 'master' into ignore-empty-files

1f58cc1

srowen approved these changes Dec 1, 2018

View reviewed changes

asfgit closed this in 3e46e3c Dec 2, 2018

gengliangwang mentioned this pull request Mar 27, 2019

[SPARK-27291][SQL] PartitioningAwareFileIndex: Filter out empty files on listing files #24227

Closed

MaxGekk deleted the ignore-empty-files branch August 17, 2019 13:33

	if (currentSize + file.length > maxSplitBytes) {
	closePartition()
	}
	// Add the given file to the current partition.
	currentSize += file.length + openCostInBytes
	currentFiles += file

[SPARK-26161][SQL] Ignore empty files in load #23130

[SPARK-26161][SQL] Ignore empty files in load #23130

Uh oh!

Conversation

MaxGekk commented Nov 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

MaxGekk commented Nov 24, 2018

Uh oh!

SparkQA commented Nov 24, 2018

Uh oh!

Uh oh!

SparkQA commented Nov 25, 2018

Uh oh!

cloud-fan commented Nov 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 28, 2018

Uh oh!

SparkQA commented Nov 29, 2018

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 2, 2018

Uh oh!

Uh oh!

MaxGekk commented Nov 24, 2018 •

edited

Loading