[SPARK-17850][Core]Add a flag to ignore corrupt files (branch 1.6) #15454

zsxwing · 2016-10-12T23:04:24Z

What changes were proposed in this pull request?

This is the patch for 1.6. It only adds Spark conf spark.files.ignoreCorruptFiles because SQL just uses HadoopRDD directly in 1.6. spark.files.ignoreCorruptFiles is true by default.

How was this patch tested?

The added test.

zsxwing · 2016-10-12T23:05:27Z

core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala

+          try {
+            finished = !reader.nextKeyValue
+          } catch {
+            case _: EOFException if ignoreCorruptFiles => finished = true


This is a behavior change to NewHadoopRDD, which may surprise the existing 1.6 users.

Yeah, I am slightly worried about this change of behavior too.
Though I think it should be fine.

zsxwing · 2016-10-12T23:06:22Z

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

@@ -245,8 +248,7 @@ class HadoopRDD[K, V](
        try {
          finished = !reader.next(key, value)
        } catch {
-          case eof: EOFException =>
-            finished = true
+          case _: EOFException if ignoreCorruptFiles => finished = true


I didn't use IOException to keep the default behavior is same as before.

sounds good

zsxwing

The problem here is should HadoopRDD and NewHadoopRDD be consistent. If so, it means we have to break the current behavior.

SparkQA · 2016-10-13T01:03:39Z

Test build #66853 has finished for PR 15454 at commit 3715203.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm

LGTM

mridulm · 2016-10-13T07:10:48Z

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

@@ -245,8 +248,7 @@ class HadoopRDD[K, V](
        try {
          finished = !reader.next(key, value)
        } catch {
-          case eof: EOFException =>
-            finished = true
+          case _: EOFException if ignoreCorruptFiles => finished = true


sounds good

mridulm · 2016-10-13T07:12:46Z

core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala

+          try {
+            finished = !reader.nextKeyValue
+          } catch {
+            case _: EOFException if ignoreCorruptFiles => finished = true


Yeah, I am slightly worried about this change of behavior too.
Though I think it should be fine.

## What changes were proposed in this pull request? This is the patch for 1.6. It only adds Spark conf `spark.files.ignoreCorruptFiles` because SQL just uses HadoopRDD directly in 1.6. `spark.files.ignoreCorruptFiles` is `true` by default. ## How was this patch tested? The added test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #15454 from zsxwing/SPARK-17850-1.6.

## What changes were proposed in this pull request? This is the patch for 1.6. It only adds Spark conf `spark.files.ignoreCorruptFiles` because SQL just uses HadoopRDD directly in 1.6. `spark.files.ignoreCorruptFiles` is `true` by default. ## How was this patch tested? The added test. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#15454 from zsxwing/SPARK-17850-1.6. (cherry picked from commit 585c565)

zsxwing · 2016-10-21T21:54:23Z

Since I have not yet heard complaints about this for 1.6, and this one may break some user's job, I'm going to close it now.

Add a flag to ignore corrupt files

3715203

zsxwing changed the title ~~[SPARK-17850][Core]Add a flag to ignore corrupt files~~ [SPARK-17850][Core]Add a flag to ignore corrupt files (branch 1.6) Oct 12, 2016

zsxwing commented Oct 12, 2016

View reviewed changes

zsxwing mentioned this pull request Oct 12, 2016

[SPARK-17850][Core]Add a flag to ignore corrupt files #15422

Closed

mridulm approved these changes Oct 13, 2016

View reviewed changes

zsxwing closed this Oct 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-17850][Core]Add a flag to ignore corrupt files (branch 1.6) #15454

[SPARK-17850][Core]Add a flag to ignore corrupt files (branch 1.6) #15454

Uh oh!

zsxwing commented Oct 12, 2016

Uh oh!

zsxwing Oct 12, 2016

Uh oh!

mridulm Oct 13, 2016

Uh oh!

zsxwing Oct 12, 2016

Uh oh!

mridulm Oct 13, 2016

Uh oh!

zsxwing left a comment

Uh oh!

SparkQA commented Oct 13, 2016

Uh oh!

mridulm left a comment

Uh oh!

mridulm Oct 13, 2016

Uh oh!

mridulm Oct 13, 2016

Uh oh!

zsxwing commented Oct 21, 2016

Uh oh!

Uh oh!

[SPARK-17850][Core]Add a flag to ignore corrupt files (branch 1.6) #15454

[SPARK-17850][Core]Add a flag to ignore corrupt files (branch 1.6) #15454

Uh oh!

Conversation

zsxwing commented Oct 12, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

zsxwing Oct 12, 2016

Choose a reason for hiding this comment

Uh oh!

mridulm Oct 13, 2016

Choose a reason for hiding this comment

Uh oh!

zsxwing Oct 12, 2016

Choose a reason for hiding this comment

Uh oh!

mridulm Oct 13, 2016

Choose a reason for hiding this comment

Uh oh!

zsxwing left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 13, 2016

Uh oh!

mridulm left a comment

Choose a reason for hiding this comment

Uh oh!

mridulm Oct 13, 2016

Choose a reason for hiding this comment

Uh oh!

mridulm Oct 13, 2016

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Oct 21, 2016

Uh oh!

Uh oh!