Skip to content

Conversation

@rbalamohan
Copy link

This is subtask of HADOOP-16604 which aims to provide copy functionality for cloud native applications. Intent of this PR is to provide copyFile(URI src, URI dst) functionality for S3AFileSystem (HADOOP-16629).

Creating new PR due to a merge mess up in #1591.

Changes w.r.t PR:1591:

  1. Fixed doc (filesystem.md)
  2. Fixed AbstractContractCopyTest.
  3. If file already exists in destination, it would overwrite dest file.
  4. Added CompletableFuture support. public CompletableFuture<Void> copyFile(URI srcFile, URI dstFile)

CompletableFuture makes the API nicer. However, CompletableFuture::get --> waitingAndGet invokes Runtime.getAvailableProcessors frequently. This can turn out
to be expensive native call depending on workload. We can optimise this later, if it turns out to be an issue.

If the destination bucket is different, relevant persmissions/policies have to be already setup, without which it would throw exceptions. Providing URI instead of path, makes it easier to mention different buckets on need basis. Since this is yet to stabilize in implemetation, we can make relevant changes in the store.

Testing was done in region=us-west-2 on my local laptop. Contract tests and huge file tests passed . Other tests are still running and I will post the results. (ITestS3AContractRename failed, but not related to this patch)

@rbalamohan
Copy link
Author

rbalamohan commented Oct 14, 2019

Test results with 8 parallel threads. (region=us-west-2)

 Tests run: 1101, Failures: 5, Errors: 24, Skipped: 318

Errors are not related to this patch.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
0 reexec 53 Docker mode activated.
_ Prechecks _
+1 dupname 1 No case conflicting files found.
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 5 new or modified test files.
_ trunk Compile Tests _
0 mvndep 69 Maven dependency ordering for branch
+1 mvninstall 1125 trunk passed
+1 compile 1087 trunk passed
+1 checkstyle 162 trunk passed
+1 mvnsite 145 trunk passed
+1 shadedclient 1158 branch has no errors when building and testing our client artifacts.
+1 javadoc 139 trunk passed
0 spotbugs 71 Used deprecated FindBugs config; considering switching to SpotBugs.
+1 findbugs 202 trunk passed
_ Patch Compile Tests _
0 mvndep 27 Maven dependency ordering for patch
+1 mvninstall 90 the patch passed
+1 compile 1058 the patch passed
+1 javac 1058 the patch passed
-0 checkstyle 174 root: The patch generated 13 new + 106 unchanged - 0 fixed = 119 total (was 106)
+1 mvnsite 147 the patch passed
+1 whitespace 0 The patch has no whitespace issues.
+1 xml 1 The patch has no ill-formed XML file.
+1 shadedclient 771 patch has no errors when building and testing our client artifacts.
+1 javadoc 136 the patch passed
-1 findbugs 79 hadoop-tools/hadoop-aws generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
_ Other Tests _
-1 unit 538 hadoop-common in the patch failed.
+1 unit 96 hadoop-aws in the patch passed.
+1 asflicense 59 The patch does not generate ASF License warnings.
7472
Reason Tests
FindBugs module:hadoop-tools/hadoop-aws
Exceptional return value of java.util.concurrent.ThreadPoolExecutor.submit(Callable) ignored in org.apache.hadoop.fs.s3a.S3AFileSystem.copyFile(URI, URI) At S3AFileSystem.java:ignored in org.apache.hadoop.fs.s3a.S3AFileSystem.copyFile(URI, URI) At S3AFileSystem.java:[line 2873]
Failed junit tests hadoop.fs.TestFilterFileSystem
hadoop.fs.TestHarFileSystem
Subsystem Report/Notes
Docker Client=19.03.3 Server=19.03.3 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1655/1/artifact/out/Dockerfile
GITHUB PR #1655
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname Linux a0a88f07421f 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / 5f4641a
Default Java 1.8.0_222
checkstyle https://builds.apache.org/job/hadoop-multibranch/job/PR-1655/1/artifact/out/diff-checkstyle-root.txt
findbugs https://builds.apache.org/job/hadoop-multibranch/job/PR-1655/1/artifact/out/new-findbugs-hadoop-tools_hadoop-aws.html
unit https://builds.apache.org/job/hadoop-multibranch/job/PR-1655/1/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-1655/1/testReport/
Max. process+thread count 1533 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-1655/1/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

@bgaborg bgaborg requested review from bgaborg and steveloughran and removed request for bgaborg October 17, 2019 14:28

package org.apache.hadoop.fs.contract.s3a;

import org.apache.hadoop.conf.Configuration;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the following import order:

java.*
javax.*
--
every but org.apache

--
org,apache.*
--
static stuff, in order


package org.apache.hadoop.fs.s3a.impl;

import org.apache.hadoop.fs.FileAlreadyExistsException;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the following import order:

java.*
javax.*
--
every but org.apache

--
org,apache.*
--
static stuff, in order

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bgaborg . IDE settings was not right. Fixed the ordering in recent commit.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
0 reexec 1798 Docker mode activated.
_ Prechecks _
+1 dupname 0 No case conflicting files found.
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 5 new or modified test files.
_ trunk Compile Tests _
0 mvndep 92 Maven dependency ordering for branch
+1 mvninstall 1101 trunk passed
+1 compile 1015 trunk passed
+1 checkstyle 160 trunk passed
+1 mvnsite 138 trunk passed
+1 shadedclient 1097 branch has no errors when building and testing our client artifacts.
+1 javadoc 132 trunk passed
0 spotbugs 72 Used deprecated FindBugs config; considering switching to SpotBugs.
+1 findbugs 193 trunk passed
_ Patch Compile Tests _
0 mvndep 25 Maven dependency ordering for patch
+1 mvninstall 82 the patch passed
+1 compile 976 the patch passed
+1 javac 976 the patch passed
-0 checkstyle 161 root: The patch generated 10 new + 106 unchanged - 0 fixed = 116 total (was 106)
+1 mvnsite 135 the patch passed
+1 whitespace 0 The patch has no whitespace issues.
+1 xml 2 The patch has no ill-formed XML file.
+1 shadedclient 763 patch has no errors when building and testing our client artifacts.
+1 javadoc 135 the patch passed
-1 findbugs 84 hadoop-tools/hadoop-aws generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
_ Other Tests _
-1 unit 525 hadoop-common in the patch failed.
+1 unit 91 hadoop-aws in the patch passed.
+1 asflicense 56 The patch does not generate ASF License warnings.
8913
Reason Tests
FindBugs module:hadoop-tools/hadoop-aws
Exceptional return value of java.util.concurrent.ThreadPoolExecutor.submit(Callable) ignored in org.apache.hadoop.fs.s3a.S3AFileSystem.copyFile(URI, URI) At S3AFileSystem.java:ignored in org.apache.hadoop.fs.s3a.S3AFileSystem.copyFile(URI, URI) At S3AFileSystem.java:[line 2923]
Failed junit tests hadoop.fs.TestFilterFileSystem
hadoop.fs.TestHarFileSystem
Subsystem Report/Notes
Docker Client=19.03.4 Server=19.03.4 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1655/2/artifact/out/Dockerfile
GITHUB PR #1655
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname Linux 64ee40b03c0e 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / 447f46d
Default Java 1.8.0_222
checkstyle https://builds.apache.org/job/hadoop-multibranch/job/PR-1655/2/artifact/out/diff-checkstyle-root.txt
findbugs https://builds.apache.org/job/hadoop-multibranch/job/PR-1655/2/artifact/out/new-findbugs-hadoop-tools_hadoop-aws.html
unit https://builds.apache.org/job/hadoop-multibranch/job/PR-1655/2/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-1655/2/testReport/
Max. process+thread count 1580 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-1655/2/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.


#### Postconditions

`dstFile` is available in the filesystem. `dstFile` matches that of `srcFile`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed, needs a bit more formality.


if not exists(FS, srcFile) : raise FileNotFoundException

if not exists(parentFolder(dstFile)) : raise [IllegalArguementException]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer FNFE

describe("Test folder copy. Recursive copy is not supported yet.");
// initiate copy
handleExpectedException(intercept(Exception.class,
() -> fs.copyFile(srcDir.toUri(), dstDir.toUri()).get()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert dest file is not there. if a file existed before, then it is unchanged

// initiate copy
fs.copyFile(zeroByteFile.toUri(), dstFile.toUri()).get();
// Check if file is available
assertIsFile(zeroByteFile);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert source still there

// Check if file is available and verify its contents
assertIsFile(srcFile);
assertIsFile(dstFile);
verifyFileContents(fs, dstFile, srcData);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for s3 we need to spin a bit here for eventual consistency; use LambdaTestUtils.eventually()

}

@Test
public void testNonExistingFileCopy() throws Throwable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add equivalent where the dest file exists; verify it is still there

@steveloughran
Copy link
Contributor

see #1679 for my proposal for a Filesystem API for multipart uploads. That is a draft implementation right now which lacks:

  1. FileContext support.
  2. updated specification.
  3. parent directory checks.

I must highlight the BulkOperationState issue. A big part of speeding up rename/delete/commit in S3A was eliminating and needless duplicate checks of S3Guard state during the bulk operations -we did this by caching the ongoing state. If you plan to copy many files in the same operation, you will need this.

If you look at how RenameOperation uses copyFile(); it is updating its RenameTracker to keep that operation state consistent; for the CommitOperations it also gets passed around by way of a CommitContext. copy (as the multipart upload) is going to have to do the same.

I can help by rounding out the #1679 proposal with the tracking of that state to show you what to do, but

  • yes, you are going to have to design something which is designed to work across different stores
  • and be stable across time.
  • with spec and tests.

I know HDFS is not above adding some private API to help out HBase and then, well nobody notices, pulling at that up into hadoop-common, but I don't like that see (HDFS-8631)[https://issues.apache.org/jira/browse/HDFS-8631?focusedCommentId=16961004&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16961004]. I do expect a fair amount of rigour here because we are going to be maintaining these APIs for decades.

private CopyResult copyFile(String srcKey, String dstKey, long size,
S3ObjectAttributes srcAttributes, S3AReadOpContext readContext)
throws IOException, InterruptedIOException {
return copyFile(bucket, srcKey, bucket, dstKey, size, srcAttributes, readContext);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't update S3Guard BTW. IF you want newly created files to be visible, your copy operation owns this problem

@steveloughran
Copy link
Contributor

Also: -1 to cross store copies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants