-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HADOOP-18012. ABFS: Using Source Path eTags for Rename Idemptonency checks #5488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented. I like the tests, even though mockito gets complicated fast and is painful to maintain.
we need tests for dir rename, and to verify that a missing source file is handled well now that failures there happen earlier.
and from my patch
- switch to turn this off
- skip for non-HNS stores
- we should look @ iostatistics updates to make sure they are good.
...-azure/src/test/java/org/apache/hadoop/fs/azurebfs/services/TestAbfsRenameRetryRecovery.java
Outdated
Show resolved
Hide resolved
boolean isMetadataIncompleteState) | ||
throws AzureBlobFileSystemException { | ||
final List<AbfsHttpHeader> requestHeaders = createDefaultHeaders(); | ||
|
||
if (sourceEtag == null || sourceEtag.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- should be skipped on non-HNS.
- currently when the etag comes in from the manifest committer, we know a file is being referenced and so renamed. once you add a getPathStatus() call you can determine whether or not the source is a dir and get its etag if a file. Would knowing if the source was a file/dir help in choosing recovery strategy?
...-azure/src/test/java/org/apache/hadoop/fs/azurebfs/services/TestAbfsRenameRetryRecovery.java
Outdated
Show resolved
Hide resolved
...-azure/src/test/java/org/apache/hadoop/fs/azurebfs/services/TestAbfsRenameRetryRecovery.java
Show resolved
Hide resolved
AbfsHttpOperation mockHttp404Op = Mockito.mock(AbfsHttpOperation.class); | ||
Mockito.doReturn(404).when(mockHttp404Op).getStatusCode(); | ||
Mockito.doNothing().when(mockHttp404Op).processResponse(nullable(byte[].class), Mockito.any(int.class), Mockito.any(int.class)); | ||
Mockito.doNothing().when(mockHttp404Op).setRequestProperty(nullable(String.class), nullable(String.class)); | ||
Mockito.doNothing().when(mockHttp404Op).sendRequest(nullable(byte[].class), Mockito.any(int.class), Mockito.any(int.class)); | ||
Mockito.doReturn("PUT").when(mockHttp404Op).getMethod(); | ||
Mockito.doReturn("Source Path not found").when(mockHttp404Op).getStorageErrorMessage(); | ||
Mockito.doReturn("SourcePathNotFound").when(mockHttp404Op).getStorageErrorCode(); | ||
|
||
|
||
// // mock object representing the 500 timeout result for first try of rename | ||
AbfsHttpOperation mockHttp500Op = Mockito.mock(AbfsHttpOperation.class); | ||
Mockito.doReturn(500).when(mockHttp500Op).getStatusCode(); | ||
Mockito.doThrow(IOException.class) | ||
.when(mockHttp500Op).processResponse(nullable(byte[].class), Mockito.any(int.class), Mockito.any(int.class)); | ||
Mockito.doNothing().when(mockHttp500Op).setRequestProperty(nullable(String.class), nullable(String.class)); | ||
Mockito.doNothing().when(mockHttp500Op).sendRequest(nullable(byte[].class), Mockito.any(int.class), Mockito.any(int.class)); | ||
Mockito.doReturn("PUT").when(mockHttp500Op).getMethod(); | ||
Mockito.doReturn("ClientTimeoutError").when(mockHttp500Op).getStorageErrorCode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets not have two mocks. As its something which needs to verify server orchestration, lets not just have it as a unit test. Lets have it an integration test :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer following code for actually using server in picture:
Mockito.doAnswer(answer -> {
AbfsRestOperation op = new AbfsRestOperation(AbfsRestOperationType.RenamePath,
spyClient, HTTP_METHOD_PUT, answer.getArgument(0), answer.getArgument(1));
AbfsRestOperation spiedOp = Mockito.spy(op);
addSpyBehavior(spiedOp, op, spyClient);
return spiedOp;
}).when(spyClient).createRenameRestOperation(nullable(URL.class), nullable(List.class));
private void addSpyBehavior(final AbfsRestOperation spiedOp, final AbfsRestOperation normalOp, AbfsClient client)
throws IOException {
AbfsHttpOperation abfsHttpOperation = Mockito.spy(normalOp.createHttpOperation());
AbfsHttpOperation normalOp1 = normalOp.createHttpOperation();
normalOp1.getConnection().setRequestProperty(HttpHeaderConfigurations.AUTHORIZATION,
client.getAccessToken());
AbfsHttpOperation normalOp2 = normalOp.createHttpOperation();
normalOp2.getConnection().setRequestProperty(HttpHeaderConfigurations.AUTHORIZATION,
client.getAccessToken());
int[] hits = new int[1];
hits[0] = 0;
Mockito.doAnswer(answer -> {
if(hits[0] == 0) {
mockIdempotencyIssueBehaviours(abfsHttpOperation, normalOp1);
hits[0]++;
return abfsHttpOperation;
}
hits[0]++;
return normalOp2;
}).when(spiedOp).createHttpOperation();
}
private void mockIdempotencyIssueBehaviours(final AbfsHttpOperation abfsHttpOperation,
final AbfsHttpOperation normalOp)
throws IOException {
Mockito.doAnswer(answer -> {
normalOp.sendRequest(answer.getArgument(0), answer.getArgument(1), answer.getArgument(2));
normalOp.processResponse(answer.getArgument(0), answer.getArgument(1), answer.getArgument(2));
throw new SocketException("connection-reset");
}).when(abfsHttpOperation).sendRequest(Mockito.nullable(byte[].class),
Mockito.nullable(int.class), Mockito.nullable(int.class));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refer to commit saxenapranav@5247e12
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
dstFileStatus = tryGetFileStatus(qualifiedDstPath, tracingContext); | ||
} | ||
|
||
// for Non-HNS accounts, rename resiliency cannot be maintained | ||
// as eTags are not preserved in rename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@steveloughran here is the change for checking and switching rename resilience flag in case of FNS accounts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a bit of an ugly way to do it. if it is being set on every call, really it's a parameter which should be passed down directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't do it this way. AzureBlobFileSystemStore.getIsNamespaceEnabled() provides the information, so add a new isNamespaceEnabled parameter to abfsclient.renamePath() and use that in the decision making
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented.
@sreeb-msft the changes in my patch are all on top of your commit #e8ce29ffce5 ; you should be able to cherry pick them from my branch directly onto your commit, then you can rebase all later changes on top.
that way, I can then pick up your changes and apply them to mine...this is how we can codevelop a branch just by
- adding each others github repos as remotes
- using git fetch to fetch
- then git cherrypick
this is how i got pranav's changes in, and it should work here too.
can you do that then look at my comments? thanks
assertEquals(false, renameResult); | ||
|
||
// validating stat counters after rename | ||
Long connMadeAfterRename = counter.getIOStatistics().counters(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use IOStatisticAssertions here, such as to get the counter (with asserts that the value is there), and creating an assertJ assertion chain from a value.
long connMadeBeforeRename = lookupCounterStatistic(iostats, STORE_IO_REQUEST.getSymbol())
...
assertThatStatisticCounter(iostats,
STORE_IO_REQUEST.getSymbol())
.isEqualTo(1 + connMadeBeforeRename);
dstFileStatus = tryGetFileStatus(qualifiedDstPath, tracingContext); | ||
} | ||
|
||
// for Non-HNS accounts, rename resiliency cannot be maintained | ||
// as eTags are not preserved in rename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a bit of an ugly way to do it. if it is being set on every call, really it's a parameter which should be passed down directly
💔 -1 overall
This message was automatically generated. |
933b0a5
to
59db26d
Compare
Have cherry-picked your changes and pushed the updates after resolving conflicts. Let me know if I should do anything different. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented. key change: pass the HNS status down explicitly to the client rather than indirectly via the config.
this will also let you add a new test which verifies that on a non-HNS store rename recovery is never attempted
- invoke renamePath(... namespaceEnabled = false)
- expect failure
- and expect count of rename attempts == 0
dstFileStatus = tryGetFileStatus(qualifiedDstPath, tracingContext); | ||
} | ||
|
||
// for Non-HNS accounts, rename resiliency cannot be maintained | ||
// as eTags are not preserved in rename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't do it this way. AzureBlobFileSystemStore.getIsNamespaceEnabled() provides the information, so add a new isNamespaceEnabled parameter to abfsclient.renamePath() and use that in the decision making
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsClient.java
Outdated
Show resolved
Hide resolved
// etag passed in, so source is a file | ||
final boolean hasEtag = !isEmpty(sourceEtag); | ||
boolean isDir = !hasEtag; | ||
if (!hasEtag && renameResilience) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and add && isNamespaceEnabled
to the condition
💔 -1 overall
This message was automatically generated. |
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystem.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsClient.java
Outdated
Show resolved
Hide resolved
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -55,6 +55,10 @@ | |||
import java.util.concurrent.TimeUnit; | |||
|
|||
import org.apache.hadoop.classification.VisibleForTesting; | |||
import org.apache.hadoop.fs.azurebfs.constants.AbfsHttpConstants; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move these back to where they were. You could also move the org.apache entry on L57 down too. It's one of those "guava search and replace" imports, which is why it is in the wrong block and confusing your IDE.
...tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystemStore.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsClient.java
Show resolved
Hide resolved
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsClient.java
Show resolved
Hide resolved
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsClient.java
Show resolved
Hide resolved
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystem.java
Outdated
Show resolved
Hide resolved
...azure/src/test/java/org/apache/hadoop/fs/azurebfs/ITestAzureBlobFileSystemDelegationSAS.java
Outdated
Show resolved
Hide resolved
...azure/src/test/java/org/apache/hadoop/fs/azurebfs/ITestAzureBlobFileSystemDelegationSAS.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I'm not sure about the changes to the metadata error in AbfsClient
- we are now checking for HNS status on startup. FWIW, bucket existence checks on s3a were a source of startup delays, especially on spark processes with many threads all starting simultaneously. If its in there, have a look at your cluster deployments and consider a smaller size of that semaphore. actually, we may want to do that in core-default.xml
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to remove HEAD call from fs.initialze
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I'm happy.
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsClient.java
Show resolved
Hide resolved
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsClient.java
Show resolved
Hide resolved
still some merge problems; rolling back my vote
+1 pending the merge and checkstyle issues
can you change this from draft to "ready for review" so I can merge this once you've done those |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 pending that checkstyle change
@@ -79,6 +81,7 @@ | |||
import static org.apache.hadoop.fs.azurebfs.constants.HttpQueryParams.*; | |||
import org.apache.hadoop.fs.azurebfs.contracts.exceptions.AbfsRestOperationException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cut this line so the move of it to the right place, L59, doesn't upset checkstyle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not seeing this change in my local latest at all, that is there are no duplicate imports. Still, making the change to pass checkstyle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
+1
all good. once yetus gives the goahead i will merge |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
…empotency (#5488) To support recovery of network failures during rename, the abfs client fetches the etag of the source file, and when recovering from a failure, uses this tag to determine whether the rename succeeded before the failure happened. * This works for files, but not directories * It adds the overhead of a HEAD request before each rename. * The option can be disabled by setting "fs.azure.enable.rename.resilience" to false Contributed by Sree Bhattacharyya
…empotency (apache#5488) To support recovery of network failures during rename, the abfs client fetches the etag of the source file, and when recovering from a failure, uses this tag to determine whether the rename succeeded before the failure happened. * This works for files, but not directories * It adds the overhead of a HEAD request before each rename. * The option can be disabled by setting "fs.azure.enable.rename.resilience" to false Contributed by Sree Bhattacharyya
RenameFilePath on its first try receives a Request timed out error with code 500. On retrying the same operation, a Source file not found (404) error is received.
This change brings in the following mitigation:
It checks whether etags remain the same before and after the retry and accordingly send an Operation Successful result, instead of source file not found. That is, if the rename actually succeeds at the backend, the original source eTag (before the operation) and the destination eTag (after the operation) should match each other. However, this logic works only in the case of HNS configurations, and files. For directories or flat namespaces, rename recovery is not attempted.
The check for whether the eTags match or not is enabled by a configuration:
fs.azure.enable.rename.resilience
.The tests introduced mock the retry scenario and check correct behavior (failure/recovery) depending on whether it is HNS, FNS, directory or file, both at the FileSystem level and by unit testing the renamePath API at AbfsClient level.