-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hive: Switch to RetryingHMSClient (allows configuration of retryDelays and retries) #3099
Conversation
Retriggering, looks like past hang due to: #3110 |
@pvary and @rdblue the change as we discussed, if you guys want to take a look. After doing some tests with it, one unexpected side effect to call out for reviewers is that due to https://issues.apache.org/jira/browse/HIVE-13014, lock() is now explicitly not allowed to retry (ref: https://github.com/apache/hive/blob/master/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java#L3413). On the one hand, correct-ness wise it makes sense to avoid retry (its possible the server actually got the lock but does not return to the client, leaving a dangling lock). On the other hand, it's a bit of a bummer as lock() is called quite frequently on Iceberg side on every commit and is a bit prone to random failures, ie https://issues.apache.org/jira/browse/HIVE-23052. |
By the way, update, lock() is not particularly more suspectible to random failure. The issue for us about lock was actually a deterministic one: https://issues.apache.org/jira/browse/HIVE-25522. (a corrupted HMS instance due to bad startup). So not sure how big of an issue that is anymore (RetryingMetaStoreClient not retrying lock() calls for safety reason). |
private static final DynMethods.StaticMethod CLIENT_CTOR = DynMethods.builder("getProxy") | ||
.impl(RetryingMetaStoreClient.class, HiveConf.class) | ||
.impl(RetryingMetaStoreClient.class, HiveConf.class, Boolean.TYPE) | ||
.impl(RetryingMetaStoreClient.class, Configuration.class, Boolean.TYPE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Order matters here. The first implementation found will be used, so this will use getProxy(HiveConf)
before it will use getProxy(HiveConf, Boolean)
. Is that intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems fine, newer versions don't even have the first constructor:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks again @jackye1995 for the pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the last link is Hive1 API (which gets pulled in in spark2 tests): https://hive.apache.org/javadocs/r1.2.2/api/org/apache/hadoop/hive/metastore/RetryingMetaStoreClient.html
try { | ||
try { | ||
return CLIENT_CTOR.newInstance(hiveConf); | ||
return CLIENT_CTOR.invoke(hiveConf, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a comment that documents the true
here? Something like this:
return GET_CLIENT.invoke(hiveConf, true /* use retries */);
I'm assuming that the true
is retry related? If so, should this detect whether the Hive implementation will retry and turn retries back on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not about retry, but to configure allowEmbedded
for the HiveMetaStoreClient
initialized internally in the RetryingMetaStoreClient
:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks @jackye1995 for the pointer. From quick glance, seems a flag just intended to determine whether to throw an exception if "hive.metastore.uris" is not set, so not terribly useful to document, unless you feel otherwise.
.build(); | ||
// use appropriate ctor depending on whether we're working with Hive1, Hive2, or Hive3 dependencies | ||
// we need to do this because there is a breaking API change between Hive1, Hive2, and Hive3 | ||
private static final DynMethods.StaticMethod CLIENT_CTOR = DynMethods.builder("getProxy") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't really a client constructor anymore. Could you rename this GET_CLIENT
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -110,57 +103,4 @@ public void testGetTablesFailsForNonReconnectableException() throws Exception { | |||
AssertHelpers.assertThrows("Should throw exception", MetaException.class, | |||
"Another meta exception", () -> clients.run(client -> client.getTables("default", "t"))); | |||
} | |||
|
|||
@Test | |||
public void testConnectionFailureRestoreForMetaException() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests seem valuable if the client doesn't retry on its own. What about passing in an option to turn on retries and then using that to test retries here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done. Put the test back.
Side note: the flexibility of making it configurable is nice, and is a potential solution to the issue I mentioned, I am tempted to set it to true in HiveClientOperations::acquireLock in this line.
LockResponse lockResponse = metaClients.run(client -> client.lock(lockRequest));
I can see from Hive point of view why they don't want to auto-retry, but most exception scenarios are not going to be in the corner case of having succeeded in the backend and leaving the lock. In all honesty, if the Iceberg job fails because lock() throws TTransportException, end user will just try again but it is a wasted effort. Not sure if any thoughts on it?
cc @pvary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having both Iceberg and Hive retry could also be problematic / surprising. We will just do NxM retries (2x2 in the default case).
I see 2 typical error scenario in case of lock
:
- Can not reach the HMS - connection is broken, or something like this - Retry is a perfect solution
- The locks are queued, but we have to wait and something happens in the meantime - Retry will be blocked forever (or until the lock timeout is reached for the previously queued locks) - Retry is problematic / unnecessary.
How typical is that the first request on the connection is the lock
? If it happens often then the 1st situation could happen often. If we usually get the metadata for the table and then try to lock then the 1st situation is unlikely to happen.
I would prefer to keep the retry login in one place (RetryingMetaStoreClient
) and simplify the Iceberg code by removing the retry altogether, but I can accept keeping the code as it is - we most probably will turn off the Iceberg retries in our deployment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pvary yea this PR disables the retry for Hive case, but leaves an option for turning it on to a particular caller, to address your concern.
About lock(), thanks for the reply, maybe you are right that it is not typical case to have TTransportException especially if getMetadata has already succeeded on same connection.
The only thing bothering me is that there is no way for the client to distinguish case 1 and 2 based on exception type, and thus nothing they can do really except retry. There is a ShowLock call, and potentially they can check if there is a queued lock from the failed call, is it it is their own , and if so then unlock it. But I'm not sure how reliable this is. Anyway its maybe for another review then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could make Hive lock() SafeToRetry?
I have to think this through, but if we release the enqueued lock in case of unexpected error inside implementation, and also delete the previous version of the lock if someone already created a lock for this transaction.
|
a3b52e5
to
c9b0b87
Compare
Thanks for the review guys. Addressed the feedback. For @jackye1995's questions,
|
hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveClientPool.java
Outdated
Show resolved
Hide resolved
@pvary thanks for review, addressed comments, sorry for delay |
@szehon-ho: This version is fine by me. Let's wait a little bit for the others to chime in, in case they have any more concerns. If there is no objections till next Monday, I will merge then, as I will be OOO at the end of this week. Please ping me if I forget to push. Thanks, Peter |
Thanks a lot Peter for following up. |
Merged. |
Thanks for getting this in, @pvary and @szehon-ho! |
Issue: #3087
The options are: implement it in ClientPoolImpl, or use RetryingMetaStoreClient. From the initial discussion, leaning towards option 2. Some justifications below :
Implementation details: adding RetryingMetaStoreClient will make redundant the ClientPoolImpl retry, so disable it for this case, but not remove it to preserve for other client pools.