-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HADOOP-15183 S3Guard store becomes inconsistent after partial failure of rename #843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-15183 S3Guard store becomes inconsistent after partial failure of rename #843
Conversation
AmazonS3 client) { | ||
Preconditions.checkArgument(isNotEmpty(s3Attributes.getBucket()), | ||
"No Bucket"); | ||
Preconditions.checkArgument(isNotEmpty(s3Attributes.getKey()), "No Key"); | ||
Preconditions.checkArgument(contentLength >= 0, "Negative content length"); | ||
long l = s3Attributes.getLen(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rename this to "length"; too easy to mix l and 1. I'd orginally used 'len' but the IDE told me off.
Tasks.foreach(commits) | ||
.suppressExceptions() | ||
.run(commit -> getCommitOperations().abortSingleCommit(commit)); | ||
try(CommitOperations.CommitContext commitContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a space
|
||
private final Callable<T> call; | ||
|
||
CallableSupplier(final Callable<T> call) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be public. This class is to let me pass in a callable which raises an IOE as an arg to one of the java 8 operations which only allow RuntimeExceptions to be raised
@@ -46,7 +46,7 @@ public static void setupClusters() throws IOException { | |||
|
|||
@AfterClass | |||
public static void teardownClusters() throws IOException { | |||
clusterBinding.terminate(); | |||
terminateCluster(clusterBinding); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see YARN-9568
import static org.junit.Assert.assertEquals; | ||
|
||
/** | ||
* Unit test suite covering translation of AWS SDK exceptions to S3A exceptions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integration test suite
* <li>repeated sorts do not change the order</li> | ||
* </ol> | ||
*/ | ||
final class PathOrderComparators { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are used to take the list of entries to add to a store and sort it so that topmost entries come first. This is to ensure that in the event of a partial failure during a put(list) operation, parent entries will have been added before children. That is: even if the DDB tables aren't complete w.r.t changes in the store (bad!) they do at least meet the requirement "no orphan entries"
💔 -1 overall
This message was automatically generated. |
ea61a61
to
24dbe82
Compare
not sure what's up there. Usually its stack traces in docs, but this is a lot. |
the 600 tabs are from the file |
💔 -1 overall
This message was automatically generated. |
unit test failures in hadoop common are unrelated; look like directory setup on test VM.
|
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still looking through this patch. Like what I'm seeing so far. Couple of random comments. I'll try to resume tomorrow or Friday
* (There's always a block for that, it just makes more sense to | ||
* perform the bulk delete after another block of copies have completed). | ||
*/ | ||
public static final int RENAME_PARALLEL_LIMIT = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, we can always make this tunable later if we care.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be some coordination with the Hive community on this, where there's another layer of parallelization. I think they'd currently have to be doing individual file renames, and if so I think they can just keep going, this change can go out, and after that they might want to remove their parallelization layer. And they don't have to interact. But I hope I'm right about that and this isn't going to cause some huge spike in number of concurrent copies or something.
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
Outdated
Show resolved
Hide resolved
// the limit of active copies has been reached; | ||
// wait for completion or errors to surface. | ||
LOG.debug("Waiting for active copies to complete"); | ||
completeActiveCopies.apply("batch threshold reached"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be a significant improvement in runtime here. We can pipeline this in the future to eke out more performance. As we've been discussing on refactoring here: Might want to break up this rename code later on as well. Maybe just helper functions to decompose into smaller blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, a refactoring would be much needed here. this part is getting huge. it's harder to review and later to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gabor: This is my attempt to clean it up. The rename logic has all been pulled out of S3AFS into the trackers and I'm defining a couple of lambda expressions in the method to avoid copy and paste duplication and effort.
The only way to do anything else would be to pull the entire code "somewhere", but because of all the calls into S3AFS, that will only make sense as part of a relayering of ops: we'd have this to work at the "object store + S3Guard layer", rather than through the FS APIs. That's a major change and not going to happen in a patch which is big enough as it is.
List<Path> undeletedObjects = new ArrayList<>(); | ||
try { | ||
// remove the keys | ||
// this does will update the metastore on a failure, but on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: /does will/does/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another nit: the asymmetry here in metadatastore updating (only on failure) feels a bit odd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, it grates on me too. But the renameTracker is invoked on success, so it is all contained in this operation.
*/ | ||
@VisibleForTesting | ||
@Retries.RetryMixed | ||
void removeKeys( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a test that exercises the partial failure logic here? (I may answer this as I get further in the review)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes: ITestS3AFailureHandling testMultiObjectDeleteNoPermissions()
"S3Guard metadata store records read"), | ||
S3GUARD_METADATASTORE_RECORD_WRITES( | ||
"s3guard_metadatastore_record_writes", | ||
"S3Guard metadata store records written"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this could be interesting for comparing different rename trackers? If so I'm curious what sort of stats / deltas you get with the new logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. With the ancestor tracking the new renamer shouldn't do any more than before. And I can verify that the commit operations don't do duplicate puts of all the parents, the way they do currently
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
Outdated
Show resolved
Hide resolved
thanks all for reviews, will go through them. FWIW, the latest patch has checks on dir updates if the batch update overwrites a file updated in the ongoing operation with a dir entry. And its failing in a prune test (one which shouldn't IMO, be doing a full store-wide prune)
|
e7453a4
to
bca1021
Compare
💔 -1 overall
This message was automatically generated. |
bca1021
to
8c65dad
Compare
💔 -1 overall
This message was automatically generated. |
This also fixed up ITestDynamoDBMetadataStore to do better destroy of test tables, as some were leaking. Change-Id: I7e70e743628498c3cbbe21c6cda796b08fd39d8d
…thout fixing it, do at least avoid teardown errors Change-Id: I869d943d9c1a110e27710ffb8e638b918db44574
Change-Id: I3a5337d36086a109687de16a8395ff8d7aa1b644
*Tests are failing; this is an interim patch* This makes S3Guard.addAncestors aware of the state of the bulk operation by pushing down the whole ancestor search and add problem into the store itself. Because each store can have its own model of bulk operation state, it can use this (on a bulk operation, obviously) to skip any queries of a remote DB to see if a path exists. And when there's a fallback to a get(path), and that finds a path entry, that's added to the state; no need to check again. - we could go better and add all parents of a found path to the bulk state; that way all ancestor creation in a put(path) will know to skip the parent entries - which may imply changing the contents of the DDBMetastore's AncestorState I've had to push the addAncestors() call into the metastores themselve to get this to work, as only DDB maintains a state map. Alternatively, we expose some contains(path), add(path) entry to the base class (Better: make that an interface; have the impl do a set<Path>) from the outset. Although the DDB AncestorState stores the state as a map of Path -> DDBPathMetadata, its only existence which is checked for. Change-Id: I64bf0bc372654af0471e13c324a7af0454615960
Change-Id: I9bceb5c5f30a555a63119ea6fdba24b023fc5458
* AncestorState explicitly exports its put/contains operations and makes inner field private. * the AncestorState set isn't updated before entries are put() (some duplicate entries are still being logged) * More log statements as needed to debug what is happening * Improved javadocs on new RenameTracker classes. * TestCommitOperations was failing a lot; eventually concluded that the DDB table was inconsistent. If you have a tombstone marker you can't delete its children, but if you do a GET on a child it will surface. This has not been fixed: tests simply clear up better. An FSCK is needed here. TODO * Tests to make sure that all directories are deleted in a rename; empty and implied. * Review of ProgressiveRenameTracker's end-of-rename operation to see if it is adding too many delete calls * See if I can think of more tests Change-Id: Ic062406f19bc7d4bf1ab794bd26e85691027235c
Change-Id: I47aef155a0f60da10ec34d30e6e9bc4de88b7fcd
* partial failure of delete when handling fake directories * Add a TODO comment where I need to do something * Fix a typo on a method name in TestPartialDeleteFailures Change-Id: Ic231d9dceb37ce4a726620e8d3343bf353fdf0c2
…e committing a job, so we can verify that job commits remove them Change-Id: I09805332a25bc8eb4158f9f162f43b97ff3a3c57
…ions is tracked ITestDynamoDBMetadataStore picks up the scale timeout Change-Id: I15428862ad5b8bf9e2713f9fd03f1645b5b4ef10
* getting the direct to DDB tests to scale better; especially ITestDynamoDBMetadataStore mainly by breaking up up the testBatchWrite test into five separate ones, with individual cleanup, and switching the cleanup from forgetMetadata to prune()...the latter does batch deletes in parallel and is a lot faster. As it was, on smaller stores, the throttling in teardown was the common source of failure. ThrottleTracker is pulled from the scale test for reuse in the non-scale option. * Metastore.put(DirListingMetadata) now takes a BulkOperationState and passes it down. This is not actively in use nor exported all the way up, though everything is ready in the store. I'm not just convinced it is needed yet. I've managed to change things enough for the ITestDynamoDBMetadataStore test to fail on the maven test runs, but not in the IDE. Also got a failure in ITestMagicCommitMRJob where there wasn't a cleanup of the underlying attempt dir Change-Id: I80169b12baacbecfb63970c0cf7ea061fe3ab117
…and generated inconsistency in the same operation. The DDB metastore now raises a PathIOE if the operation state lists a file at a path and another operation scans for a parent directory there. This is only going to happen in one of: deliberate test, someone has been creating files under files. Change-Id: Ie22cfa6dafa3b9cbc4533f82b115af432a1c4dfb
…rState failure on a file found is now optionally downgradeable to a warn. Still unsure what to do there, as its a symptom of a problem: there's a file in the store which is a parent of another file. For prune, we can downgrade this to a warn and then rely on FSCK to fix Change-Id: I2f2cf9b85bbd3acd8a606e669c1b7e27a9db16f4
8c65dad
to
38d1f75
Compare
💔 -1 overall
This message was automatically generated. |
* Add HADOOP-16364: translate timeout on destroy() into new exception: TableDeleteTimeoutException. * ITestDynamoDBMetadataStore performance. Coalescing three tests into one, and inserting a new check for version marker missing a version field. The tests which create and destroy tables are a key part of the bottleneck here, which is where the coalescing makes a big difference. FWIW, I think we could actually use the shared table for the version checking and just restore it after, to save 60s or so. * Trying to recreate the failure I saw where Prune failed because of an inconsistent ancestor write. I can't seem to replicate the failure (and I deleted the table with the problem). There's now scope in the metastore to decide what to do if there is an inconsistency (throw vs warn), and I'm wondering if we need to use the operation state to choose how to react. That is: on a duplicate, inconsistent prune, do "something" else. But what? Note: if inconsistencies in the store can be amplified by prune, this is not a new problem. It's by tracking the ancestor state that it is being detected. Change-Id: I2e83d986a1f072d98251443e0f9c65c7036ba8ce
@@ -1326,8 +1330,25 @@ | |||
|
|||
<property> | |||
<name>fs.s3a.max.total.tasks</name> | |||
<value>5</value> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about sticking HADOOP-15729 in here? Essentially just providing 0 as the number of core threads to the executor, otherwise a lot of these threads never get cleaned up when idle. This manifests as a massive number of idle threads in long-running services like Impala and Hive Metastore when you've accessed many different buckets. I believe this patch would make the problem worse, so it's a good time to toss in the trivial fix and test it all together,
* After the operation has completed, it <i>MUST</i> be closed so | ||
* as to guarantee that all state is released. | ||
*/ | ||
public class BulkOperationState implements Closeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see when this ever gets associated with any actual state. What's the point of it?
*/ | ||
@RetryTranslated | ||
void put(PathMetadata meta, | ||
@Nullable BulkOperationState BulkOperationState) throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A variable with the same name as it's type - is this more Java 8 craziness I haven't seen before? :)
checkArgument(!pathsToDelete.contains(sourcePath), | ||
"File being renamed is already processed %s", destPath); | ||
// create the file metadata and update the local structures. | ||
S3Guard.addMoveFile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully by the time I'm done reviewing I'll understand the architecture here enough to answer this myself, but we've previously talked about an alternate rename strategy where we create everything, and then we delete everything, so there's always 1 copy of the complete data. How well does this change fit in with that idea? In the event that there's a failure we delete the incomplete copy on retry. If that's the source, we're now done. If that's the destination, we now start over. Nice and recoverable, and until it succeeds you haven't disturbed anything.
|
||
// check if UNGUARDED_FLAG is passed and use NullMetadataStore in | ||
// config to avoid side effects like creating the table if not exists | ||
Configuration conf0 = getConf(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename this to something like unguardedConf please
Change-Id: I1185b21cb14d9c2238a27dc00f5eac930959f1a4
I've just pushed up a new version which includes the checkstyle complaints and the various nits reported. Tested s3 ireland. All good except for the new test where I'm trying to create an inconsistency in the table to cause problems in prune. I think I'm going to put that to one side right now and worry about that in some tests for fsck in a subsequent patch After this iteration I'm going to have to move to a new PR; yetus is unhappy again. |
47e9d94
to
115fb77
Compare
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
closed to create a squashed one and keep yetus happy; rebase to trunk first. |
new PR is #951. Please can people review that as it is ~ready to go in and in need of criticism |
This PR makes the following changes: * Moves all the state-restore code from TaskStorageManager.scala to ContainerStorageManager (and its internal private java classes). * Introduces a StoreMode in StorageEngineFactory.getStorageEngine to add a StoreMode enum. * Changes RocksDB store creation to use that enum and use Rocksdb's bulk load option when creating store in bulk-load mode. * Changes the ContainerStorageManager to create stores in BulkLoad mode when restoring, then closes such persistent and changelogged stores, and re-opens them in Read-Write mode. * Adds tests for ContainerStorageManager and changes tests for TaskStorageManager accordingly. Author: Ray Matharu <rmatharu@linkedin.com> Author: rmatharu <40646191+rmatharu@users.noreply.github.com> Reviewers: Jagadish Venkatraman <vjagadish1989@gmail.com> Closes apache#843 from rmatharu/refactoringCSM
This is a major patch which has grown to become a "fix up the final issues with Auth mode" patch, built according to my proposed refactoring S3A design.
Multi-object delete:
o.a.h.fs.s3a.impl.MultiObjectDeleteSupport
handles partial delete failures by parsing the response and updating the metastore with those deleted entriesrename (i.e. copy + delete), plus scale and perf of commits
Each metastore has the notion of
BulkOperationState
; this can be used by a store to track entries which have been added/found during a set of operations (sequential or in parallel)RenameTracker
which implements the algorithm for updating the store on rename. TheDelayedUpdateTracker
implements the classic "do it at the end" algorithm, while theProgressiveRenameTracker
does it incrementally.addAncestors()
call which can update the bulk state based on examining the store and the current state of theBulkOperationState
. This has been pushed down fromS3Guard.addAncestors()
to allow them to do their own thing with the bulk state.S3AFilesystem.innerRename
now copies in parallel; hard coded at 10 for now (having it a factor of the delete page size makes sense). each parallel copy operation notifies the currentRenameTracker
of its completion, so letting them choose what to do (DelayedUpdate
: saves the change;ProgressiveRenameTracker
: updates the store immediately).The RenameTracker also has the homework of updating the store state on the case of a partial rename failure.
DelayedUpdateTracker
: update the metastore; ProgressiveUpdate: no-op.Local and DDB metastores now only use the
ProgressiveRenameTracker
; this was done after I'd move the classic design into its own tracker (i.e out of S3AFileSystem). Oh, and theres aNullRenameTracker
for the null metastore.The BulkOperationState ends up being passed all the way to the committers; to avoid them explicitly having to understand this, the
CommitOperations
class now has an inner (non-static)CommitContext
class which contains this and exports the commit/abort/revert operations for the committers to call. Internally it then invokes WriteOperationHelper calls with the context, which then gets all the way through toS3AFileSystem.finishedWrite()
for the updates there to decide what entries to put into the store.With this design
Note that for both the parallelized commit and copy ops, we can also shuffle the source data so that we aren't doing it on the same lowest-level directory (what I'd expect today). That should reduce load on the S3 Store. With a move to parallel copy operations that will make a difference: the list comes in sequentially so the initial set of 10 renames will be adjacent. Actually I could be even cleverer there and sort by size. Why so? ensures that a large file doesn't become the straggler in the copy.