-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-17850][Core]Add a flag to ignore corrupt files (branch 1.6) #15454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
try { | ||
finished = !reader.nextKeyValue | ||
} catch { | ||
case _: EOFException if ignoreCorruptFiles => finished = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a behavior change to NewHadoopRDD
, which may surprise the existing 1.6 users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am slightly worried about this change of behavior too.
Though I think it should be fine.
@@ -245,8 +248,7 @@ class HadoopRDD[K, V]( | |||
try { | |||
finished = !reader.next(key, value) | |||
} catch { | |||
case eof: EOFException => | |||
finished = true | |||
case _: EOFException if ignoreCorruptFiles => finished = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't use IOException
to keep the default behavior is same as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is should HadoopRDD and NewHadoopRDD be consistent. If so, it means we have to break the current behavior.
Test build #66853 has finished for PR 15454 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -245,8 +248,7 @@ class HadoopRDD[K, V]( | |||
try { | |||
finished = !reader.next(key, value) | |||
} catch { | |||
case eof: EOFException => | |||
finished = true | |||
case _: EOFException if ignoreCorruptFiles => finished = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
try { | ||
finished = !reader.nextKeyValue | ||
} catch { | ||
case _: EOFException if ignoreCorruptFiles => finished = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am slightly worried about this change of behavior too.
Though I think it should be fine.
## What changes were proposed in this pull request? This is the patch for 1.6. It only adds Spark conf `spark.files.ignoreCorruptFiles` because SQL just uses HadoopRDD directly in 1.6. `spark.files.ignoreCorruptFiles` is `true` by default. ## How was this patch tested? The added test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #15454 from zsxwing/SPARK-17850-1.6.
## What changes were proposed in this pull request? This is the patch for 1.6. It only adds Spark conf `spark.files.ignoreCorruptFiles` because SQL just uses HadoopRDD directly in 1.6. `spark.files.ignoreCorruptFiles` is `true` by default. ## How was this patch tested? The added test. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#15454 from zsxwing/SPARK-17850-1.6. (cherry picked from commit 585c565)
Since I have not yet heard complaints about this for 1.6, and this one may break some user's job, I'm going to close it now. |
What changes were proposed in this pull request?
This is the patch for 1.6. It only adds Spark conf
spark.files.ignoreCorruptFiles
because SQL just uses HadoopRDD directly in 1.6.spark.files.ignoreCorruptFiles
istrue
by default.How was this patch tested?
The added test.