-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-35831][YARN][test-maven] Handle PathOperationException in copyFileToRemote on the same src and dest #32983
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
… with force on the same src and dest
} catch { | ||
// HADOOP-16878 changes the behavior to throw exceptions when src equals to dest | ||
case e: PathOperationException | ||
if force && srcFs.makeQualified(srcPath).equals(destFs.makeQualified(destPath)) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just skip calling copy
in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean overwrite
in Spark side as a fallback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean don't call FileUtil.copy
in the case, if we already know it will throw an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception depends on the underlying Hadoop versions. Technically, Apache Hadoop 3.3.1 also should not throw this. I suspect that something is messed up in Jenkins .m2
directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I don't want to introduce any behavior change here. As @HyukjinKwon 's pointed out, there might be a corner case in some cases (#32983 (comment)).
This doesn't seems to handle the case where the nameservice of both source & target filesystem is same, but they point to different clusters. So, it gives a false alarm in that case.
Kubernetes integration test starting |
Kubernetes integration test status success |
} catch { | ||
// HADOOP-16878 changes the behavior to throw exceptions when src equals to dest | ||
case e: PathOperationException | ||
if force && srcFs.makeQualified(srcPath).equals(destFs.makeQualified(destPath)) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for doubly sure, previously it just silently ignored the operation instead of forcing the copy from the source to destination when both paths are same with force
set true
, and this PR keeps the same behaviour regardless of the Hadoop version?
Instead of catching the exception, Maybe we can just check if src and dst are the same, and if so skip the copying? |
cc @steveloughran, @tgravescs, @mridulm too fyi |
Thank you for review, @HyukjinKwon and @sunchao , and @viirya .
Yes, both skipping in Spark side or overwriting by Spark-side are possible. For this PR, I just didn't want to change the previous Spark logic (invoking Hadoop) and to fix the UT failure.
|
Although the current situation (reverted code's code) looks weird to me too, I made this approach because we need to handle Apache Hadoop 3.4 eventually. |
I think it's all good as long as it keeps the behaviour as same as before. My only concern is the reason why the specific patch was reverted in Hadoop. From reading the Hadoop JIRA,
It seems like we didn't simply ignore when the paths were same, and seems like it broke something (that was found during tests apparently). If this case is not realistic in the production, I think it should be good to goo with this fix. |
Although HADOOP-16878 is reopened, it's only reverted from Here is my understanding. From Apache Spark side,
val localConfArchive = new Path(createConfArchive(confsToOverride).toURI())
copyFileToRemote(destDir, localConfArchive, replication, symlinkCache, force = true,
destName = Some(LOCALIZED_CONF_ARCHIVE)) |
Test build #140033 has finished for PR 32983 at commit
|
The current UT failure is
Locally,
|
Retest this please. |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #140037 has finished for PR 32983 at commit
|
The failure is a known |
Retest this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #140039 has finished for PR 32983 at commit
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
Test build #140045 has finished for PR 32983 at commit
|
Retest this please |
1 similar comment
Retest this please |
Kubernetes integration test unable to build dist. exiting with code: 1 |
Thanks for the ping. |
Another safe fix is that we have our own version of
We can remove this hack after we find out the root cause or we upgrade to a newer version with a fix. |
Thank you. I updated the PR title.
Yes, it's also possible, @gengliangwang , but Apache Hadoop 3.3.1 is also new and the dependency can be reverted during RC period.
|
Until now, it's really hard to get a successful run in Jenkins due to many reasons. I'm trying to stabilize the Jenkins first. That's the reason why this PR is minimal and avoids to add a new change. Can we have this first and consider all the other options, @viirya , @sunchao , @gengliangwang ? I agreed that all your advices are reasonable. |
BTW, this PR still is waiting for a successful Jenkin run. |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #140048 has finished for PR 32983 at commit
|
Retest this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #140052 has finished for PR 32983 at commit
|
Could you review this once more, @gengliangwang ? This PR passed the Jenkins with Maven while the Jenkins on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work!
Merging to master to fix the test builds. |
Thank you so much, @gengliangwang ! |
Thanks @dongjoon-hyun ! the PR LGTM too. It's interesting that we still hit HADOOP-16878 even though it's reverted 🤔 |
Yes, it's really weird. I checked the 3.3.1 clinet jars with 'javap', but the class doesn't have those string. Thus, I'm suspecting some Jenkins environment issues. |
Afraid people will soon be hitting HADOOP-17771 with this; a regression which got past tests by QE and others on EC2, and us developers (AWS CLI installed). Both bemused and unhappy about this. Everyone else is just "unhappy" for varying levels of unhappy.
|
What changes were proposed in this pull request?
This PR aims to be more robust on the underlying Hadoop library changes. Apache Spark's
copyFileToRemote
has an option,force
, to invoke copying always and it can hitorg.apache.hadoop.fs.PathOperationException
in some Hadoop versions.From Apache Hadoop 3.3.1, we reverted HADOOP-16878 as the last revert commit on
branch-3.3.1
. However, it's still in Apache Hadoop 3.4.0.Why are the changes needed?
Currently, Apache Spark Jenkins hits a flakiness issue.
Apache Spark has three cases.
!compareFs(srcFs, destFs)
: This is safe because we will not have this exception."file".equals(srcFs.getScheme)
: This is safe because this cannot be afalse
alarm.force=true
:good
alarm part, Spark works in the same way.false
alarm part, Spark is safe because we useforce = true
only for copyinglocalConfArchive
instead of a general copy between two random clusters.Does this PR introduce any user-facing change?
No. This preserves the previous Apache Spark behavior.
How was this patch tested?
Pass the Jenkins with Maven.