[SPARK-35831][YARN][test-maven] Handle PathOperationException in copyFileToRemote on the same src and dest #32983

dongjoon-hyun · 2021-06-19T21:59:13Z

What changes were proposed in this pull request?

This PR aims to be more robust on the underlying Hadoop library changes. Apache Spark's copyFileToRemote has an option, force, to invoke copying always and it can hit org.apache.hadoop.fs.PathOperationException in some Hadoop versions.

From Apache Hadoop 3.3.1, we reverted HADOOP-16878 as the last revert commit on branch-3.3.1. However, it's still in Apache Hadoop 3.4.0.

apache/hadoop@a3b9c37

Why are the changes needed?

Currently, Apache Spark Jenkins hits a flakiness issue.

org.apache.hadoop.fs.PathOperationException: 
`Source (file:/home/jenkins/workspace/spark-master-test-maven-hadoop-3.2/resource-managers/yarn/target/tmp/spark-703b8e99-63cc-4ba6-a9bc-25c7cae8f5f9/testJar9120517778809167117.jar) and destination (/home/jenkins/workspace/spark-master-test-maven-hadoop-3.2/resource-managers/yarn/target/tmp/spark-703b8e99-63cc-4ba6-a9bc-25c7cae8f5f9/testJar9120517778809167117.jar)
are equal in the copy command.': Operation not supported
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:403)

Apache Spark has three cases.

!compareFs(srcFs, destFs): This is safe because we will not have this exception.
"file".equals(srcFs.getScheme): This is safe because this cannot be a false alarm.
force=true:
- For the good alarm part, Spark works in the same way.
- For the false alarm part, Spark is safe because we use force = true only for copying localConfArchive instead of a general copy between two random clusters.

val localConfArchive = new Path(createConfArchive(confsToOverride).toURI())
copyFileToRemote(destDir, localConfArchive, replication, symlinkCache, force = true,
destName = Some(LOCALIZED_CONF_ARCHIVE))

Does this PR introduce any user-facing change?

No. This preserves the previous Apache Spark behavior.

How was this patch tested?

Pass the Jenkins with Maven.

dongjoon-hyun · 2021-06-19T22:07:21Z

cc @sunchao and @viirya

… with force on the same src and dest

dongjoon-hyun · 2021-06-20T00:28:13Z

cc @gengliangwang

viirya · 2021-06-20T00:54:47Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

+      } catch {
+        // HADOOP-16878 changes the behavior to throw exceptions when src equals to dest
+        case e: PathOperationException
+            if force && srcFs.makeQualified(srcPath).equals(destFs.makeQualified(destPath)) =>


should we just skip calling copy in this case?

Do you mean overwrite in Spark side as a fallback?

I mean don't call FileUtil.copy in the case, if we already know it will throw an exception.

The exception depends on the underlying Hadoop versions. Technically, Apache Hadoop 3.3.1 also should not throw this. I suspect that something is messed up in Jenkins .m2 directory.

BTW, I don't want to introduce any behavior change here. As @HyukjinKwon 's pointed out, there might be a corner case in some cases (#32983 (comment)).

This doesn't seems to handle the case where the nameservice of both source & target filesystem is same, but they point to different clusters. So, it gives a false alarm in that case.

SparkQA · 2021-06-20T00:58:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44560/

SparkQA · 2021-06-20T01:31:39Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44560/

HyukjinKwon · 2021-06-20T02:40:46Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

+      } catch {
+        // HADOOP-16878 changes the behavior to throw exceptions when src equals to dest
+        case e: PathOperationException
+            if force && srcFs.makeQualified(srcPath).equals(destFs.makeQualified(destPath)) =>


Just for doubly sure, previously it just silently ignored the operation instead of forcing the copy from the source to destination when both paths are same with force set true, and this PR keeps the same behaviour regardless of the Hadoop version?

sunchao · 2021-06-20T02:41:55Z

Instead of catching the exception, Maybe we can just check if src and dst are the same, and if so skip the copying?

HyukjinKwon · 2021-06-20T02:47:10Z

cc @steveloughran, @tgravescs, @mridulm too fyi

dongjoon-hyun · 2021-06-20T02:50:10Z

Thank you for review, @HyukjinKwon and @sunchao , and @viirya .

The currently AS-IS master branch UTs are failing with this when it happens.
This PR aims to ignore the Hadoop exception if that is due to the same file copy. If src=dest, there is no regression here.

Yes, both skipping in Spark side or overwriting by Spark-side are possible. For this PR, I just didn't want to change the previous Spark logic (invoking Hadoop) and to fix the UT failure.

Instead of catching the exception, Maybe we can just check if src and dst are the same, and if so skip the copying?

dongjoon-hyun · 2021-06-20T02:51:57Z

Although the current situation (reverted code's code) looks weird to me too, I made this approach because we need to handle Apache Hadoop 3.4 eventually.

HyukjinKwon · 2021-06-20T02:57:21Z

I think it's all good as long as it keeps the behaviour as same as before.

My only concern is the reason why the specific patch was reverted in Hadoop. From reading the Hadoop JIRA,

This doesn't seems to handle the case where the nameservice of both source & target filesystem is same, but they point to different clusters. So, it gives a false alarm in that case.

It seems like we didn't simply ignore when the paths were same, and seems like it broke something (that was found during tests apparently). If this case is not realistic in the production, I think it should be good to goo with this fix.

dongjoon-hyun · 2021-06-20T03:34:05Z

Although HADOOP-16878 is reopened, it's only reverted from branch-3.3.1. trunk still has it.

https://github.com/apache/hadoop/blame/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java#L402-L405

Here is my understanding. From Apache Spark side,

We are hitting Jenkins failure and HADOOP-16878 looks like a breaking behavior change which is inappropriate in the maintenance release (3.3.0 -> 3.3.1). I'm happy for the reverting, but we need to investigate why we are hitting this. (The investigation is a separate issue.)
Spark has three cases.
- !compareFs(srcFs, destFs): This is safe because we will not have this exception.
- "file".equals(srcFs.getScheme): This is safe because this cannot be a false alarm.
- force=true:
  - For the good alarm part, Spark works in the same way.
  - For the false alarm part, Spark is safe because we use force = true only for copying localConfArchive instead of a general copy between two random clusters.

val localConfArchive = new Path(createConfArchive(confsToOverride).toURI())
copyFileToRemote(destDir, localConfArchive, replication, symlinkCache, force = true,
destName = Some(LOCALIZED_CONF_ARCHIVE))

SparkQA · 2021-06-20T06:19:05Z

Test build #140033 has finished for PR 32983 at commit eeebbed.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-06-20T06:47:56Z

distribute jars archive passed in Jenkins.

ClientSuite:
...
- distribute jars archive

The current UT failure is YarnClusterSuite which looks irrelevant.

YarnClusterSuite:
- run Spark in yarn-client mode
- run Spark in yarn-cluster mode
- run Spark in yarn-client mode with unmanaged am
- run Spark in yarn-client mode with different configurations, ensuring redaction *** FAILED ***
  The code passed to eventually never returned normally. Attempted 190 times over 3.0014907111 minutes. Last failure message: handle.getState().isFinal() was false. (BaseYarnClusterSuite.scala:198)
- run Spark in yarn-cluster mode with different configurations, ensuring redaction *** FAILED ***
  The code passed to eventually never returned normally. Attempted 190 times over 3.0011775924166666 minutes. Last failure message: handle.getState().isFinal() was false. (BaseYarnClusterSuite.scala:198)
- yarn-cluster should respect conf overrides in SparkHadoopUtil (SPARK-16414, SPARK-23630) *** FAILED ***
  The code passed to eventually never returned normally. Attempted 190 times over 3.001122617833333 minutes. Last failure message: handle.getState().isFinal() was false. (BaseYarnClusterSuite.scala:198)
- run Spark in yarn-client mode with additional jar *** FAILED ***
  The code passed to eventually never returned normally. Attempted 190 times over 3.001084003683333 minutes. Last failure message: handle.getState().isFinal() was false. (BaseYarnClusterSuite.scala:198)

Locally, YarnClusterSuite passes like the following.

[info] YarnClusterSuite:
[info] - run Spark in yarn-client mode (8 seconds, 259 milliseconds)
[info] - run Spark in yarn-cluster mode (10 seconds, 96 milliseconds)
[info] - run Spark in yarn-client mode with unmanaged am (7 seconds, 96 milliseconds)
[info] - run Spark in yarn-client mode with different configurations, ensuring redaction (8 seconds, 92 milliseconds)
[info] - run Spark in yarn-cluster mode with different configurations, ensuring redaction (9 seconds, 112 milliseconds)
[info] - yarn-cluster should respect conf overrides in SparkHadoopUtil (SPARK-16414, SPARK-23630) (9 seconds, 110 milliseconds)
[info] - run Spark in yarn-client mode with additional jar (7 seconds, 115 milliseconds)
[info] - run Spark in yarn-cluster mode with additional jar (10 seconds, 94 milliseconds)
[info] - run Spark in yarn-cluster mode unsuccessfully (6 seconds, 121 milliseconds)
[info] - run Spark in yarn-cluster mode failure after sc initialized (15 seconds, 122 milliseconds)
[info] - run Python application in yarn-client mode (9 seconds, 123 milliseconds)
[info] - run Python application in yarn-cluster mode (10 seconds, 128 milliseconds)
[info] - run Python application in yarn-cluster mode using spark.yarn.appMasterEnv to override local envvar (10 seconds, 123 milliseconds)
[info] - user class path first in client mode (8 seconds, 104 milliseconds)
[info] - user class path first in cluster mode (9 seconds, 112 milliseconds)
[info] - monitor app using launcher library (5 seconds, 170 milliseconds)
[info] - running Spark in yarn-cluster mode displays driver log links (10 seconds, 110 milliseconds)
[info] - timeout to get SparkContext in cluster mode triggers failure (11 seconds, 107 milliseconds)
[info] - executor env overwrite AM env in client mode (8 seconds, 111 milliseconds)
[info] - executor env overwrite AM env in cluster mode (9 seconds, 112 milliseconds)
[info] - SPARK-34472: ivySettings file with no scheme or file:// scheme should be localized on driver in cluster mode (17 seconds, 221 milliseconds)
[info] - SPARK-34472: ivySettings file with no scheme or file:// scheme should retain user provided path in client mode (14 seconds, 194 milliseconds)
[info] - SPARK-34472: ivySettings file with non-file:// schemes should throw an error (1 second, 719 milliseconds)
[info] Run completed in 3 minutes, 59 seconds.
[info] Total number of tests run: 23
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 23, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 245 s (04:05), completed Jun 19, 2021 11:51:30 PM

dongjoon-hyun · 2021-06-20T06:51:44Z

Retest this please.

dongjoon-hyun · 2021-06-20T07:14:16Z

After upgrading to Apache Hadoop 3.3.1, it fails 8 of 10 times. I'd like to land this to the master and do the follow-ups in a healthier Jenkins.

SparkQA · 2021-06-20T08:01:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44564/

SparkQA · 2021-06-20T08:35:11Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44564/

SparkQA · 2021-06-20T08:45:32Z

Test build #140037 has finished for PR 32983 at commit eeebbed.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-06-20T08:50:24Z

The failure is a known StackOverflow failure.

dongjoon-hyun · 2021-06-20T08:50:30Z

Retest this please

SparkQA · 2021-06-20T10:13:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44566/

SparkQA · 2021-06-20T10:46:22Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44566/

SparkQA · 2021-06-20T11:26:13Z

Test build #140039 has finished for PR 32983 at commit eeebbed.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-20T12:59:22Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44572/

SparkQA · 2021-06-20T13:52:21Z

Test build #140045 has finished for PR 32983 at commit c6cfd65.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-06-20T16:37:24Z

Retest this please

dongjoon-hyun · 2021-06-20T17:11:22Z

Retest this please

SparkQA · 2021-06-20T17:14:43Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44575/

gengliangwang · 2021-06-20T17:34:09Z

Thanks for the ping.
Since the force is no longer checked, @dongjoon-hyun shall we update the PR title?

gengliangwang · 2021-06-20T17:41:27Z

Another safe fix is that we have our own version of copy which is the same as the current Hadoop 3.3.1 code, instead of calling FileUtil.copy.

  /** Copy files between FileSystems. */
  public static boolean copy(FileSystem srcFS, FileStatus srcStatus,
                             FileSystem dstFS, Path dst,
                             boolean deleteSource,
                             boolean overwrite,
                             Configuration conf) throws IOException {
    Path src = srcStatus.getPath();
    dst = checkDest(src.getName(), dstFS, dst, overwrite);
    if (srcStatus.isDirectory()) {
      checkDependencies(srcFS, src, dstFS, dst);
      if (!dstFS.mkdirs(dst)) {
        return false;
      }
      FileStatus contents[] = srcFS.listStatus(src);
      for (int i = 0; i < contents.length; i++) {
        copy(srcFS, contents[i], dstFS,
             new Path(dst, contents[i].getPath().getName()),
             deleteSource, overwrite, conf);
      }
    } else {
      InputStream in=null;
      OutputStream out = null;
      try {
        in = srcFS.open(src);
        out = dstFS.create(dst, overwrite);
        IOUtils.copyBytes(in, out, conf, true);
      } catch (IOException e) {
        IOUtils.closeStream(out);
        IOUtils.closeStream(in);
        throw e;
      }
    }
    if (deleteSource) {
      return srcFS.delete(src, true);
    } else {
      return true;
    }

  }

We can remove this hack after we find out the root cause or we upgrade to a newer version with a fix.

dongjoon-hyun · 2021-06-20T17:57:16Z

Thank you. I updated the PR title.

Since the force is no longer checked, @dongjoon-hyun shall we update the PR title?

Yes, it's also possible, @gengliangwang , but Apache Hadoop 3.3.1 is also new and the dependency can be reverted during RC period.

Another safe fix is that we have our own version of copy which is the same as the current Hadoop 3.3.1 code, instead of calling FileUtil.copy.

dongjoon-hyun · 2021-06-20T17:59:42Z

Until now, it's really hard to get a successful run in Jenkins due to many reasons. I'm trying to stabilize the Jenkins first. That's the reason why this PR is minimal and avoids to add a new change. Can we have this first and consider all the other options, @viirya , @sunchao , @gengliangwang ? I agreed that all your advices are reasonable.

dongjoon-hyun · 2021-06-20T18:01:43Z

BTW, this PR still is waiting for a successful Jenkin run.

SparkQA · 2021-06-20T18:03:23Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44576/

SparkQA · 2021-06-20T18:38:21Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44576/

SparkQA · 2021-06-20T18:47:53Z

Test build #140048 has finished for PR 32983 at commit c6cfd65.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-06-20T23:12:19Z

Retest this please

SparkQA · 2021-06-21T00:14:52Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44580/

SparkQA · 2021-06-21T00:49:59Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44580/

SparkQA · 2021-06-21T05:03:54Z

Test build #140052 has finished for PR 32983 at commit c6cfd65.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-06-21T15:19:06Z

Could you review this once more, @gengliangwang ?

This PR passed the Jenkins with Maven while the Jenkins on master branch is still failing with this test case.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2/3033/testReport/junit/org.apache.spark.deploy.yarn/ClientSuite/distribute_jars_archive/

gengliangwang

Thanks for the work!

gengliangwang · 2021-06-21T15:27:16Z

Merging to master to fix the test builds.

dongjoon-hyun · 2021-06-21T15:30:35Z

Thank you so much, @gengliangwang !

sunchao · 2021-06-21T17:22:30Z

Thanks @dongjoon-hyun ! the PR LGTM too. It's interesting that we still hit HADOOP-16878 even though it's reverted 🤔

dongjoon-hyun · 2021-06-22T02:50:39Z

Yes, it's really weird. I checked the 3.3.1 clinet jars with 'javap', but the class doesn't have those string. Thus, I'm suspecting some Jenkins environment issues.

steveloughran · 2021-06-24T12:52:09Z

Afraid people will soon be hitting HADOOP-17771 with this; a regression which got past tests by QE and others on EC2, and us developers (AWS CLI installed). Both bemused and unhappy about this. Everyone else is just "unhappy" for varying levels of unhappy.

take a look at HADOOP-17771. S3AFS creation fails "Unable to find a region via the region provider chain." hadoop#3133
Created SPARK-35878 to cover adding a trivial fix....there'll be no need to wait for a hadoop release with the fix in

github-actions bot added the YARN label Jun 19, 2021

dongjoon-hyun mentioned this pull request Jun 19, 2021

[SPARK-29250][BUILD] Upgrade to Hadoop 3.3.1 #30135

Closed

This comment has been minimized.

Sign in to view

[SPARK-35831][YARN] Handle PathOperationException in copyFileToRemote…

eeebbed

… with force on the same src and dest

viirya reviewed Jun 20, 2021

View reviewed changes

HyukjinKwon reviewed Jun 20, 2021

View reviewed changes

remove force

c6cfd65

dongjoon-hyun changed the title ~~[SPARK-35831][YARN][test-maven] Handle PathOperationException in copyFileToRemote with force on the same src and dest~~ [SPARK-35831][YARN][test-maven] Handle PathOperationException in copyFileToRemote on the same src and dest Jun 20, 2021

gengliangwang approved these changes Jun 21, 2021

View reviewed changes

gengliangwang closed this in 2b9902d Jun 21, 2021

dongjoon-hyun deleted the SPARK-35831 branch June 21, 2021 15:30

[SPARK-35831][YARN][test-maven] Handle PathOperationException in copyFileToRemote on the same src and dest #32983

[SPARK-35831][YARN][test-maven] Handle PathOperationException in copyFileToRemote on the same src and dest #32983

Uh oh!

Conversation

dongjoon-hyun commented Jun 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Jun 19, 2021

Uh oh!

This comment has been minimized.

This comment has been minimized.

dongjoon-hyun commented Jun 20, 2021

Uh oh!

viirya Jun 20, 2021

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jun 20, 2021

Choose a reason for hiding this comment

Uh oh!

viirya Jun 20, 2021

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jun 20, 2021

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jun 20, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 20, 2021

Uh oh!

SparkQA commented Jun 20, 2021

Uh oh!

HyukjinKwon Jun 20, 2021

Choose a reason for hiding this comment

Uh oh!

sunchao commented Jun 20, 2021

Uh oh!

HyukjinKwon commented Jun 20, 2021

Uh oh!

dongjoon-hyun commented Jun 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Jun 20, 2021

Uh oh!

HyukjinKwon commented Jun 20, 2021

Uh oh!

dongjoon-hyun commented Jun 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jun 20, 2021

Uh oh!

dongjoon-hyun commented Jun 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Jun 20, 2021

Uh oh!

dongjoon-hyun commented Jun 20, 2021

Uh oh!

SparkQA commented Jun 20, 2021

Uh oh!

SparkQA commented Jun 20, 2021

Uh oh!

SparkQA commented Jun 20, 2021

Uh oh!

dongjoon-hyun commented Jun 20, 2021

Uh oh!

dongjoon-hyun commented Jun 20, 2021

Uh oh!

SparkQA commented Jun 20, 2021

Uh oh!

SparkQA commented Jun 20, 2021

Uh oh!

SparkQA commented Jun 20, 2021

Uh oh!

SparkQA commented Jun 20, 2021

Uh oh!

dongjoon-hyun commented Jun 19, 2021 •

edited

Loading

dongjoon-hyun commented Jun 20, 2021 •

edited

Loading

dongjoon-hyun commented Jun 20, 2021 •

edited

Loading

dongjoon-hyun commented Jun 20, 2021 •

edited

Loading

gengliangwang commented Jun 20, 2021 •

edited

Loading