Skip to content

HDFS-15484. Add new method batchRename for DistributedFileSystem and W… #2235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: trunk
Choose a base branch
from

Conversation

yangagile
Copy link

…ebHdfsFileSystem

NOTICE

Please create an issue in ASF JIRA before opening a pull request,
and you need to set the title of the pull request which starts with
the corresponding JIRA issue number. (e.g. HADOOP-XXXXX. Fix a typo in YYY.)
For more details, please see https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've discussed some of what I'd like in https://issues.apache.org/jira/browse/HDFS-15484?focusedCommentId=17162752&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17162752

  • matching hadoop common JIRA to make clear this is going near the FS APIs
  • something in the fileysystem spec to cover the new interface, define its semantics, etc. In particular, we need all things which are "obvious" to be written down, because often it turns out they aren't obvious at all.
  • tests which really set out to break things. Writing the spec will help you think of them

Some ideas for tests

  • renaming root
  • rename to root
  • rename to self
  • path under self
  • path above self
  • two sources to same dest
  • chained rename
  • swapping paths

API wise, this could be our chance to fix rename properly, that is: I should be able to use this for a single rename((src, dest), opts) and have it do what I want. And as discussed, I Want something which works well with object stores

  • use a builder to let apps specify options (see openFile()) and use the same base builder classes
  • Return a future of the outcome. If we can get the HADOOP-16830 IOStatistics patch in first, the outome returned can be declared as it something which implement IOStatisticSource. This matters to me, as I want to know the costs of rename operations.

I think we should also add a rename option about atomicity; for a single rename() this would be that the rename itself is atomic. For a batch with size > 1, this means "the entire set of renames are atomic".

FileContext and ViewFS will also need to pass this through. Sorry.

One thing we could do here is actually provide a base implementation which iterates through the list/array of (src, dest) paths. This would let us add a non-atomic implementation to all filesystems/filecontexts. That would be very nice as it really would let me switch to using this API wherever we used rename(), such as distcp and MR committers.

rename() is that the trickiest of all FS API calls to get right. I don't think we fully understand what right is. certainly if I was asked about the nuances (src = file, dest = dir) and (src = dir, dest=dir) I'm not confident I could give an answer which is consistent for both POSIX and HDFS. This is our opportunity to make some progress here!

I know, this is going to add more work. But it is time.

* Rename a batch files or directories.
* @see ClientProtocol#batchRename(String[] , String[], Options.Rename...)
*/
public void batchRename(String[] srcs, String[] dsts,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better as a list of <src, dest> pairs, so it's obvious about the mapping. Add javadocs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.


public BatchOpsException(long index, long total,
String cause) {
super("Batch operation break! " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about "break!"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to "Batch operation partial success"

@InterfaceAudience.Private
@InterfaceStability.Evolving
public final class BatchOpsException extends IOException {
private static final long serialVersionUID = 1L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a real serial version ID; your IDE can help there

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, Thanks!


public BatchOpsException(long index, long total, Throwable cause) {
this(index, total,
cause.getClass().getName() + ": " + cause.getMessage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cause.toString(); message may be null

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a more check.

@@ -139,4 +139,11 @@ private CommonPathCapabilities() {
public static final String FS_MULTIPART_UPLOADER =
"fs.capability.multipart.uploader";

/**
* Does the store support multipart uploading?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Path p = new Path(dir, tag + "_" + i);
if (createNum-- > 0) {
DFSTestUtil.createFile(dfs, p, 10, (short) 1, 0);
assertTrue(dfs.exists(p));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ContractTestUtils assertFile() (or similar)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

dsts.toArray(new String[dsts.size()]));
} catch (BatchOpsException e) {
long index = e.getIndex();
assertEquals(1, index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all tests to add a message about what a failure means. Imagine you have a yetus test run failure and are trying to debug from the log only

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


@InterfaceAudience.Public
@InterfaceStability.Unstable
public interface BatchOperations {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BatchRename

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I change to BatchRename, I thought we may add other batch method in this interface in the future.

* @param dsts target file list.
* @throws IOException failure exception.
*/
void batchRename(String[] srcs, String[] dsts, Options.Rename... options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer list or array of pairs. We don't have any pair type in hadoop here and can't use commons-lang as we don't want that in our public API. Maybe we should add one to org.apache.hadoop.common.utils and use it here amongst other places. I could certainly use it (I may be able to add this to HADOOP-16830 for you to pick up)

Return a future where we define RenameResult as something (class/interface) which implements IOStatisticsSource.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll integrate this later.

{
validateOpParams(op, destination);
final EnumSet<Options.Rename> s = renameOptions.getValue();
cp.batchRename(fullpath.split(":"), destination.getValue().split(":"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will be done here if there's a ":" in paths

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ":" is invalid character for hadoop absolute path, please refer to the method DFSUtilClient.isValidName().

@yangagile
Copy link
Author

I've discussed some of what I'd like in https://issues.apache.org/jira/browse/HDFS-15484?focusedCommentId=17162752&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17162752

  • matching hadoop common JIRA to make clear this is going near the FS APIs
  • something in the fileysystem spec to cover the new interface, define its semantics, etc. In particular, we need all things which are "obvious" to be written down, because often it turns out they aren't obvious at all.
  • tests which really set out to break things. Writing the spec will help you think of them

Some ideas for tests

  • renaming root
  • rename to root
  • rename to self
  • path under self
  • path above self
  • two sources to same dest
  • chained rename
  • swapping paths

API wise, this could be our chance to fix rename properly, that is: I should be able to use this for a single rename((src, dest), opts) and have it do what I want. And as discussed, I Want something which works well with object stores

  • use a builder to let apps specify options (see openFile()) and use the same base builder classes
  • Return a future of the outcome. If we can get the HADOOP-16830 IOStatistics patch in first, the outome returned can be declared as it something which implement IOStatisticSource. This matters to me, as I want to know the costs of rename operations.

I think we should also add a rename option about atomicity; for a single rename() this would be that the rename itself is atomic. For a batch with size > 1, this means "the entire set of renames are atomic".

FileContext and ViewFS will also need to pass this through. Sorry.

One thing we could do here is actually provide a base implementation which iterates through the list/array of (src, dest) paths. This would let us add a non-atomic implementation to all filesystems/filecontexts. That would be very nice as it really would let me switch to using this API wherever we used rename(), such as distcp and MR committers.

rename() is that the trickiest of all FS API calls to get right. I don't think we fully understand what right is. certainly if I was asked about the nuances (src = file, dest = dir) and (src = dir, dest=dir) I'm not confident I could give an answer which is consistent for both POSIX and HDFS. This is our opportunity to make some progress here!

I know, this is going to add more work. But it is time.

Thanks @steveloughran for the detailed introduction.
Yes, we should do the right things, and implemnt step by step.

@yangagile yangagile changed the title HDFS-15484 Add new method batchRename for DistributedFileSystem and W… HDFS-15484. Add new method batchRename for DistributedFileSystem and W… Aug 24, 2020
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 28m 49s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 buf 0m 1s buf was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 3m 21s Maven dependency ordering for branch
+1 💚 mvninstall 25m 58s trunk passed
+1 💚 compile 19m 41s trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1
+1 💚 compile 17m 0s trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01
+1 💚 checkstyle 3m 0s trunk passed
+1 💚 mvnsite 4m 58s trunk passed
+1 💚 shadedclient 22m 43s branch has no errors when building and testing our client artifacts.
+1 💚 javadoc 3m 26s trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1
+1 💚 javadoc 5m 10s trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01
+0 🆗 spotbugs 1m 24s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 9m 33s trunk passed
-0 ⚠️ patch 1m 46s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 27s Maven dependency ordering for patch
-1 ❌ mvninstall 0m 22s hadoop-hdfs-rbf in the patch failed.
-1 ❌ compile 5m 25s root in the patch failed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1.
-1 ❌ cc 5m 25s root in the patch failed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1.
-1 ❌ javac 5m 25s root in the patch failed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1.
-1 ❌ compile 4m 53s root in the patch failed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01.
-1 ❌ cc 4m 53s root in the patch failed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01.
-1 ❌ javac 4m 53s root in the patch failed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01.
-0 ⚠️ checkstyle 2m 41s root: The patch generated 13 new + 672 unchanged - 4 fixed = 685 total (was 676)
-1 ❌ mvnsite 0m 25s hadoop-hdfs-rbf in the patch failed.
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 shadedclient 13m 42s patch has no errors when building and testing our client artifacts.
-1 ❌ javadoc 0m 36s hadoop-hdfs-project_hadoop-hdfs-client-jdkUbuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
-1 ❌ javadoc 1m 19s hadoop-common-project_hadoop-common-jdkPrivateBuild-1.8.0_265-8u265-b01-0ubuntu218.04-b01 with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu218.04-b01 generated 6 new + 1 unchanged - 0 fixed = 7 total (was 1)
-1 ❌ findbugs 0m 27s hadoop-hdfs-rbf in the patch failed.
_ Other Tests _
-1 ❌ unit 9m 6s hadoop-common in the patch passed.
+1 💚 unit 2m 5s hadoop-hdfs-client in the patch passed.
-1 ❌ unit 108m 7s hadoop-hdfs in the patch passed.
-1 ❌ unit 0m 36s hadoop-hdfs-rbf in the patch failed.
-1 ❌ asflicense 0m 45s The patch generated 3 ASF License warnings.
309m 16s
Reason Tests
Failed junit tests hadoop.fs.TestFilterFileSystem
hadoop.fs.TestHarFileSystem
hadoop.hdfs.TestFileChecksumCompositeCrc
hadoop.hdfs.server.namenode.TestFSEditLogLoader
hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier
hadoop.hdfs.server.namenode.TestCacheDirectives
hadoop.hdfs.server.namenode.snapshot.TestSnapshotDiffReport
hadoop.hdfs.TestFileChecksum
Subsystem Report/Notes
Docker ClientAPI=1.40 ServerAPI=1.40 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2235/5/artifact/out/Dockerfile
GITHUB PR #2235
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle cc buflint bufcompat
uname Linux 7f6c2ee9c03a 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 60de592
Default Java Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01
mvninstall https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2235/5/artifact/out/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs-rbf.txt
compile https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2235/5/artifact/out/patch-compile-root-jdkUbuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1.txt
cc https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2235/5/artifact/out/patch-compile-root-jdkUbuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1.txt
javac https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2235/5/artifact/out/patch-compile-root-jdkUbuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1.txt
compile https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2235/5/artifact/out/patch-compile-root-jdkPrivateBuild-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01.txt
cc https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2235/5/artifact/out/patch-compile-root-jdkPrivateBuild-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01.txt
javac https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2235/5/artifact/out/patch-compile-root-jdkPrivateBuild-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01.txt
checkstyle https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2235/5/artifact/out/diff-checkstyle-root.txt
mvnsite https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2235/5/artifact/out/patch-mvnsite-hadoop-hdfs-project_hadoop-hdfs-rbf.txt
javadoc https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2235/5/artifact/out/diff-javadoc-javadoc-hadoop-hdfs-project_hadoop-hdfs-client-jdkUbuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1.txt
javadoc https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2235/5/artifact/out/diff-javadoc-javadoc-hadoop-common-project_hadoop-common-jdkPrivateBuild-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01.txt
findbugs https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2235/5/artifact/out/patch-findbugs-hadoop-hdfs-project_hadoop-hdfs-rbf.txt
unit https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2235/5/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt
unit https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2235/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
unit https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2235/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2235/5/testReport/
asflicense https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2235/5/artifact/out/patch-asflicense-problems.txt
Max. process+thread count 5057 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs hadoop-hdfs-project/hadoop-hdfs-rbf U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2235/5/console
versions git=2.17.1 maven=3.6.0 findbugs=4.0.6
Powered by Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

@apache apache deleted a comment from hadoop-yetus Sep 1, 2020
@apache apache deleted a comment from hadoop-yetus Sep 1, 2020
@apache apache deleted a comment from hadoop-yetus Sep 1, 2020
@apache apache deleted a comment from hadoop-yetus Sep 1, 2020
@steveloughran
Copy link
Contributor

Updated #743 with a better (still WiP) attempt to factor out rename/3 in FileSystem and let implementations use efficiently. Once in we can just make rename/3 public there and see what subset of the (new) test we will need.

  • have a look @ the comments on the commits to see my thoughts on next steps
  • and the rename.md file is where I want to put all the rename logic.

While we do this work in parallel (and me in my spare time :), you can create a rename.md file there too, just with your section....we can resolve the conflict when the time comes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants