Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35831][YARN][test-maven] Handle PathOperationException in copyFileToRemote on the same src and dest #32983

Closed
wants to merge 2 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jun 19, 2021

What changes were proposed in this pull request?

This PR aims to be more robust on the underlying Hadoop library changes. Apache Spark's copyFileToRemote has an option, force, to invoke copying always and it can hit org.apache.hadoop.fs.PathOperationException in some Hadoop versions.

From Apache Hadoop 3.3.1, we reverted HADOOP-16878 as the last revert commit on branch-3.3.1. However, it's still in Apache Hadoop 3.4.0.

Why are the changes needed?

Currently, Apache Spark Jenkins hits a flakiness issue.

org.apache.hadoop.fs.PathOperationException: 
`Source (file:/home/jenkins/workspace/spark-master-test-maven-hadoop-3.2/resource-managers/yarn/target/tmp/spark-703b8e99-63cc-4ba6-a9bc-25c7cae8f5f9/testJar9120517778809167117.jar) and destination (/home/jenkins/workspace/spark-master-test-maven-hadoop-3.2/resource-managers/yarn/target/tmp/spark-703b8e99-63cc-4ba6-a9bc-25c7cae8f5f9/testJar9120517778809167117.jar)
are equal in the copy command.': Operation not supported
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:403)

Apache Spark has three cases.

  • !compareFs(srcFs, destFs): This is safe because we will not have this exception.
  • "file".equals(srcFs.getScheme): This is safe because this cannot be a false alarm.
  • force=true:
    • For the good alarm part, Spark works in the same way.
    • For the false alarm part, Spark is safe because we use force = true only for copying localConfArchive instead of a general copy between two random clusters.
val localConfArchive = new Path(createConfArchive(confsToOverride).toURI())
copyFileToRemote(destDir, localConfArchive, replication, symlinkCache, force = true,
destName = Some(LOCALIZED_CONF_ARCHIVE))

Does this PR introduce any user-facing change?

No. This preserves the previous Apache Spark behavior.

How was this patch tested?

Pass the Jenkins with Maven.

@dongjoon-hyun
Copy link
Member Author

cc @sunchao and @viirya

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@dongjoon-hyun
Copy link
Member Author

cc @gengliangwang

} catch {
// HADOOP-16878 changes the behavior to throw exceptions when src equals to dest
case e: PathOperationException
if force && srcFs.makeQualified(srcPath).equals(destFs.makeQualified(destPath)) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just skip calling copy in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean overwrite in Spark side as a fallback?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean don't call FileUtil.copy in the case, if we already know it will throw an exception.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception depends on the underlying Hadoop versions. Technically, Apache Hadoop 3.3.1 also should not throw this. I suspect that something is messed up in Jenkins .m2 directory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I don't want to introduce any behavior change here. As @HyukjinKwon 's pointed out, there might be a corner case in some cases (#32983 (comment)).

This doesn't seems to handle the case where the nameservice of both source & target filesystem is same, but they point to different clusters. So, it gives a false alarm in that case.

@SparkQA
Copy link

SparkQA commented Jun 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44560/

@SparkQA
Copy link

SparkQA commented Jun 20, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44560/

} catch {
// HADOOP-16878 changes the behavior to throw exceptions when src equals to dest
case e: PathOperationException
if force && srcFs.makeQualified(srcPath).equals(destFs.makeQualified(destPath)) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for doubly sure, previously it just silently ignored the operation instead of forcing the copy from the source to destination when both paths are same with force set true, and this PR keeps the same behaviour regardless of the Hadoop version?

@sunchao
Copy link
Member

sunchao commented Jun 20, 2021

Instead of catching the exception, Maybe we can just check if src and dst are the same, and if so skip the copying?

@HyukjinKwon
Copy link
Member

cc @steveloughran, @tgravescs, @mridulm too fyi

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jun 20, 2021

Thank you for review, @HyukjinKwon and @sunchao , and @viirya .

  • The currently AS-IS master branch UTs are failing with this when it happens.
  • This PR aims to ignore the Hadoop exception if that is due to the same file copy. If src=dest, there is no regression here.

Yes, both skipping in Spark side or overwriting by Spark-side are possible. For this PR, I just didn't want to change the previous Spark logic (invoking Hadoop) and to fix the UT failure.

Instead of catching the exception, Maybe we can just check if src and dst are the same, and if so skip the copying?

@dongjoon-hyun
Copy link
Member Author

Although the current situation (reverted code's code) looks weird to me too, I made this approach because we need to handle Apache Hadoop 3.4 eventually.

@HyukjinKwon
Copy link
Member

I think it's all good as long as it keeps the behaviour as same as before.

My only concern is the reason why the specific patch was reverted in Hadoop. From reading the Hadoop JIRA,

This doesn't seems to handle the case where the nameservice of both source & target filesystem is same, but they point to different clusters. So, it gives a false alarm in that case.

It seems like we didn't simply ignore when the paths were same, and seems like it broke something (that was found during tests apparently). If this case is not realistic in the production, I think it should be good to goo with this fix.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jun 20, 2021

Although HADOOP-16878 is reopened, it's only reverted from branch-3.3.1. trunk still has it.

Here is my understanding. From Apache Spark side,

  • We are hitting Jenkins failure and HADOOP-16878 looks like a breaking behavior change which is inappropriate in the maintenance release (3.3.0 -> 3.3.1). I'm happy for the reverting, but we need to investigate why we are hitting this. (The investigation is a separate issue.)
  • Spark has three cases.
    • !compareFs(srcFs, destFs): This is safe because we will not have this exception.
    • "file".equals(srcFs.getScheme): This is safe because this cannot be a false alarm.
    • force=true:
      • For the good alarm part, Spark works in the same way.
      • For the false alarm part, Spark is safe because we use force = true only for copying localConfArchive instead of a general copy between two random clusters.
val localConfArchive = new Path(createConfArchive(confsToOverride).toURI())
copyFileToRemote(destDir, localConfArchive, replication, symlinkCache, force = true,
destName = Some(LOCALIZED_CONF_ARCHIVE))

@SparkQA
Copy link

SparkQA commented Jun 20, 2021

Test build #140033 has finished for PR 32983 at commit eeebbed.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jun 20, 2021

distribute jars archive passed in Jenkins.

ClientSuite:
...
- distribute jars archive

The current UT failure is YarnClusterSuite which looks irrelevant.

YarnClusterSuite:
- run Spark in yarn-client mode
- run Spark in yarn-cluster mode
- run Spark in yarn-client mode with unmanaged am
- run Spark in yarn-client mode with different configurations, ensuring redaction *** FAILED ***
  The code passed to eventually never returned normally. Attempted 190 times over 3.0014907111 minutes. Last failure message: handle.getState().isFinal() was false. (BaseYarnClusterSuite.scala:198)
- run Spark in yarn-cluster mode with different configurations, ensuring redaction *** FAILED ***
  The code passed to eventually never returned normally. Attempted 190 times over 3.0011775924166666 minutes. Last failure message: handle.getState().isFinal() was false. (BaseYarnClusterSuite.scala:198)
- yarn-cluster should respect conf overrides in SparkHadoopUtil (SPARK-16414, SPARK-23630) *** FAILED ***
  The code passed to eventually never returned normally. Attempted 190 times over 3.001122617833333 minutes. Last failure message: handle.getState().isFinal() was false. (BaseYarnClusterSuite.scala:198)
- run Spark in yarn-client mode with additional jar *** FAILED ***
  The code passed to eventually never returned normally. Attempted 190 times over 3.001084003683333 minutes. Last failure message: handle.getState().isFinal() was false. (BaseYarnClusterSuite.scala:198)

Locally, YarnClusterSuite passes like the following.

[info] YarnClusterSuite:
[info] - run Spark in yarn-client mode (8 seconds, 259 milliseconds)
[info] - run Spark in yarn-cluster mode (10 seconds, 96 milliseconds)
[info] - run Spark in yarn-client mode with unmanaged am (7 seconds, 96 milliseconds)
[info] - run Spark in yarn-client mode with different configurations, ensuring redaction (8 seconds, 92 milliseconds)
[info] - run Spark in yarn-cluster mode with different configurations, ensuring redaction (9 seconds, 112 milliseconds)
[info] - yarn-cluster should respect conf overrides in SparkHadoopUtil (SPARK-16414, SPARK-23630) (9 seconds, 110 milliseconds)
[info] - run Spark in yarn-client mode with additional jar (7 seconds, 115 milliseconds)
[info] - run Spark in yarn-cluster mode with additional jar (10 seconds, 94 milliseconds)
[info] - run Spark in yarn-cluster mode unsuccessfully (6 seconds, 121 milliseconds)
[info] - run Spark in yarn-cluster mode failure after sc initialized (15 seconds, 122 milliseconds)
[info] - run Python application in yarn-client mode (9 seconds, 123 milliseconds)
[info] - run Python application in yarn-cluster mode (10 seconds, 128 milliseconds)
[info] - run Python application in yarn-cluster mode using spark.yarn.appMasterEnv to override local envvar (10 seconds, 123 milliseconds)
[info] - user class path first in client mode (8 seconds, 104 milliseconds)
[info] - user class path first in cluster mode (9 seconds, 112 milliseconds)
[info] - monitor app using launcher library (5 seconds, 170 milliseconds)
[info] - running Spark in yarn-cluster mode displays driver log links (10 seconds, 110 milliseconds)
[info] - timeout to get SparkContext in cluster mode triggers failure (11 seconds, 107 milliseconds)
[info] - executor env overwrite AM env in client mode (8 seconds, 111 milliseconds)
[info] - executor env overwrite AM env in cluster mode (9 seconds, 112 milliseconds)
[info] - SPARK-34472: ivySettings file with no scheme or file:// scheme should be localized on driver in cluster mode (17 seconds, 221 milliseconds)
[info] - SPARK-34472: ivySettings file with no scheme or file:// scheme should retain user provided path in client mode (14 seconds, 194 milliseconds)
[info] - SPARK-34472: ivySettings file with non-file:// schemes should throw an error (1 second, 719 milliseconds)
[info] Run completed in 3 minutes, 59 seconds.
[info] Total number of tests run: 23
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 23, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 245 s (04:05), completed Jun 19, 2021 11:51:30 PM

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@dongjoon-hyun
Copy link
Member Author

After upgrading to Apache Hadoop 3.3.1, it fails 8 of 10 times. I'd like to land this to the master and do the follow-ups in a healthier Jenkins.
Screen Shot 2021-06-20 at 12 11 29 AM

@SparkQA
Copy link

SparkQA commented Jun 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44564/

@SparkQA
Copy link

SparkQA commented Jun 20, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44564/

@SparkQA
Copy link

SparkQA commented Jun 20, 2021

Test build #140037 has finished for PR 32983 at commit eeebbed.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

The failure is a known StackOverflow failure.

@dongjoon-hyun
Copy link
Member Author

Retest this please

@SparkQA
Copy link

SparkQA commented Jun 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44566/

@SparkQA
Copy link

SparkQA commented Jun 20, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44566/

@SparkQA
Copy link

SparkQA commented Jun 20, 2021

Test build #140039 has finished for PR 32983 at commit eeebbed.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 20, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44572/

@SparkQA
Copy link

SparkQA commented Jun 20, 2021

Test build #140045 has finished for PR 32983 at commit c6cfd65.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please

1 similar comment
@dongjoon-hyun
Copy link
Member Author

Retest this please

@SparkQA
Copy link

SparkQA commented Jun 20, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44575/

@gengliangwang
Copy link
Member

gengliangwang commented Jun 20, 2021

Thanks for the ping.
Since the force is no longer checked, @dongjoon-hyun shall we update the PR title?

@gengliangwang
Copy link
Member

Another safe fix is that we have our own version of copy which is the same as the current Hadoop 3.3.1 code, instead of calling FileUtil.copy.

  /** Copy files between FileSystems. */
  public static boolean copy(FileSystem srcFS, FileStatus srcStatus,
                             FileSystem dstFS, Path dst,
                             boolean deleteSource,
                             boolean overwrite,
                             Configuration conf) throws IOException {
    Path src = srcStatus.getPath();
    dst = checkDest(src.getName(), dstFS, dst, overwrite);
    if (srcStatus.isDirectory()) {
      checkDependencies(srcFS, src, dstFS, dst);
      if (!dstFS.mkdirs(dst)) {
        return false;
      }
      FileStatus contents[] = srcFS.listStatus(src);
      for (int i = 0; i < contents.length; i++) {
        copy(srcFS, contents[i], dstFS,
             new Path(dst, contents[i].getPath().getName()),
             deleteSource, overwrite, conf);
      }
    } else {
      InputStream in=null;
      OutputStream out = null;
      try {
        in = srcFS.open(src);
        out = dstFS.create(dst, overwrite);
        IOUtils.copyBytes(in, out, conf, true);
      } catch (IOException e) {
        IOUtils.closeStream(out);
        IOUtils.closeStream(in);
        throw e;
      }
    }
    if (deleteSource) {
      return srcFS.delete(src, true);
    } else {
      return true;
    }

  }

We can remove this hack after we find out the root cause or we upgrade to a newer version with a fix.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-35831][YARN][test-maven] Handle PathOperationException in copyFileToRemote with force on the same src and dest [SPARK-35831][YARN][test-maven] Handle PathOperationException in copyFileToRemote on the same src and dest Jun 20, 2021
@dongjoon-hyun
Copy link
Member Author

Thank you. I updated the PR title.

Since the force is no longer checked, @dongjoon-hyun shall we update the PR title?

Yes, it's also possible, @gengliangwang , but Apache Hadoop 3.3.1 is also new and the dependency can be reverted during RC period.

Another safe fix is that we have our own version of copy which is the same as the current Hadoop 3.3.1 code, instead of calling FileUtil.copy.

@dongjoon-hyun
Copy link
Member Author

Until now, it's really hard to get a successful run in Jenkins due to many reasons. I'm trying to stabilize the Jenkins first. That's the reason why this PR is minimal and avoids to add a new change. Can we have this first and consider all the other options, @viirya , @sunchao , @gengliangwang ? I agreed that all your advices are reasonable.

@dongjoon-hyun
Copy link
Member Author

BTW, this PR still is waiting for a successful Jenkin run.

@SparkQA
Copy link

SparkQA commented Jun 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44576/

@SparkQA
Copy link

SparkQA commented Jun 20, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44576/

@SparkQA
Copy link

SparkQA commented Jun 20, 2021

Test build #140048 has finished for PR 32983 at commit c6cfd65.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please

@SparkQA
Copy link

SparkQA commented Jun 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44580/

@SparkQA
Copy link

SparkQA commented Jun 21, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44580/

@SparkQA
Copy link

SparkQA commented Jun 21, 2021

Test build #140052 has finished for PR 32983 at commit c6cfd65.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Could you review this once more, @gengliangwang ?

This PR passed the Jenkins with Maven while the Jenkins on master branch is still failing with this test case.

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work!

@gengliangwang
Copy link
Member

Merging to master to fix the test builds.

@dongjoon-hyun
Copy link
Member Author

Thank you so much, @gengliangwang !

@dongjoon-hyun dongjoon-hyun deleted the SPARK-35831 branch June 21, 2021 15:30
@sunchao
Copy link
Member

sunchao commented Jun 21, 2021

Thanks @dongjoon-hyun ! the PR LGTM too. It's interesting that we still hit HADOOP-16878 even though it's reverted 🤔

@dongjoon-hyun
Copy link
Member Author

Yes, it's really weird. I checked the 3.3.1 clinet jars with 'javap', but the class doesn't have those string. Thus, I'm suspecting some Jenkins environment issues.

@steveloughran
Copy link
Contributor

Afraid people will soon be hitting HADOOP-17771 with this; a regression which got past tests by QE and others on EC2, and us developers (AWS CLI installed). Both bemused and unhappy about this. Everyone else is just "unhappy" for varying levels of unhappy.

  1. take a look at HADOOP-17771. S3AFS creation fails "Unable to find a region via the region provider chain." hadoop#3133
  2. Created SPARK-35878 to cover adding a trivial fix....there'll be no need to wait for a hadoop release with the fix in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants