Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive: Switch to RetryingHMSClient (allows configuration of retryDelays and retries) #3099

Merged
merged 4 commits into from
Oct 18, 2021

Conversation

szehon-ho
Copy link
Collaborator

@szehon-ho szehon-ho commented Sep 10, 2021

Issue: #3087

The options are: implement it in ClientPoolImpl, or use RetryingMetaStoreClient. From the initial discussion, leaning towards option 2. Some justifications below :

  • RetryingMetaStoreClient is used today in Hive and Spark, and is more battle-tested. HiveClientPool will have to catch up to all the Hive exception types to retry: Hive: Reconnect client if connection exception is wrapped in RuntimeE… #2844 for some missing exceptions
  • Handles UGI impersonation logic for reconnect, which is missing HiveClientPool (needed in Kerberized environments)
  • ClientPoolImpl does not support any configuration of retry and retry-backoff. It hard-codes to 2 attempts, 0 backoff, (the default in RetryingMetaStoreClient is 1s backoff, 3 retries for instance)
  • Re-using RetryingMetaStoreClient can unify all hive configs for the execution engine, instead of having the configs per catalog. For instance in Spark, setting 'spark.hadoop.hive.metastore.client.connect.retry.delay' will set it for all Hive connections (for Iceberg and non-Iceberg tables)

Implementation details: adding RetryingMetaStoreClient will make redundant the ClientPoolImpl retry, so disable it for this case, but not remove it to preserve for other client pools.

@szehon-ho
Copy link
Collaborator Author

szehon-ho commented Sep 13, 2021

Retriggering, looks like past hang due to: #3110

@szehon-ho
Copy link
Collaborator Author

szehon-ho commented Sep 14, 2021

@pvary and @rdblue the change as we discussed, if you guys want to take a look.

After doing some tests with it, one unexpected side effect to call out for reviewers is that due to https://issues.apache.org/jira/browse/HIVE-13014, lock() is now explicitly not allowed to retry (ref: https://github.com/apache/hive/blob/master/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java#L3413).

On the one hand, correct-ness wise it makes sense to avoid retry (its possible the server actually got the lock but does not return to the client, leaving a dangling lock). On the other hand, it's a bit of a bummer as lock() is called quite frequently on Iceberg side on every commit and is a bit prone to random failures, ie https://issues.apache.org/jira/browse/HIVE-23052.

@szehon-ho
Copy link
Collaborator Author

szehon-ho commented Sep 16, 2021

By the way, update, lock() is not particularly more suspectible to random failure. The issue for us about lock was actually a deterministic one: https://issues.apache.org/jira/browse/HIVE-25522. (a corrupted HMS instance due to bad startup). So not sure how big of an issue that is anymore (RetryingMetaStoreClient not retrying lock() calls for safety reason).

private static final DynMethods.StaticMethod CLIENT_CTOR = DynMethods.builder("getProxy")
.impl(RetryingMetaStoreClient.class, HiveConf.class)
.impl(RetryingMetaStoreClient.class, HiveConf.class, Boolean.TYPE)
.impl(RetryingMetaStoreClient.class, Configuration.class, Boolean.TYPE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Order matters here. The first implementation found will be used, so this will use getProxy(HiveConf) before it will use getProxy(HiveConf, Boolean). Is that intended?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks again @jackye1995 for the pointers.

Copy link
Collaborator Author

@szehon-ho szehon-ho Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the last link is Hive1 API (which gets pulled in in spark2 tests): https://hive.apache.org/javadocs/r1.2.2/api/org/apache/hadoop/hive/metastore/RetryingMetaStoreClient.html

try {
try {
return CLIENT_CTOR.newInstance(hiveConf);
return CLIENT_CTOR.invoke(hiveConf, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a comment that documents the true here? Something like this:

  return GET_CLIENT.invoke(hiveConf, true /* use retries */);

I'm assuming that the true is retry related? If so, should this detect whether the Hive implementation will retry and turn retries back on?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not about retry, but to configure allowEmbedded for the HiveMetaStoreClient initialized internally in the RetryingMetaStoreClient:

https://github.com/apache/hive/blob/9f03e3c416834b18faf7a85338b130ed0742275f/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java#L165-L167

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks @jackye1995 for the pointer. From quick glance, seems a flag just intended to determine whether to throw an exception if "hive.metastore.uris" is not set, so not terribly useful to document, unless you feel otherwise.

.build();
// use appropriate ctor depending on whether we're working with Hive1, Hive2, or Hive3 dependencies
// we need to do this because there is a breaking API change between Hive1, Hive2, and Hive3
private static final DynMethods.StaticMethod CLIENT_CTOR = DynMethods.builder("getProxy")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really a client constructor anymore. Could you rename this GET_CLIENT or something?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -110,57 +103,4 @@ public void testGetTablesFailsForNonReconnectableException() throws Exception {
AssertHelpers.assertThrows("Should throw exception", MetaException.class,
"Another meta exception", () -> clients.run(client -> client.getTables("default", "t")));
}

@Test
public void testConnectionFailureRestoreForMetaException() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests seem valuable if the client doesn't retry on its own. What about passing in an option to turn on retries and then using that to test retries here?

Copy link
Collaborator Author

@szehon-ho szehon-ho Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done. Put the test back.

Side note: the flexibility of making it configurable is nice, and is a potential solution to the issue I mentioned, I am tempted to set it to true in HiveClientOperations::acquireLock in this line.

LockResponse lockResponse = metaClients.run(client -> client.lock(lockRequest));

I can see from Hive point of view why they don't want to auto-retry, but most exception scenarios are not going to be in the corner case of having succeeded in the backend and leaving the lock. In all honesty, if the Iceberg job fails because lock() throws TTransportException, end user will just try again but it is a wasted effort. Not sure if any thoughts on it?
cc @pvary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having both Iceberg and Hive retry could also be problematic / surprising. We will just do NxM retries (2x2 in the default case).

I see 2 typical error scenario in case of lock:

  1. Can not reach the HMS - connection is broken, or something like this - Retry is a perfect solution
  2. The locks are queued, but we have to wait and something happens in the meantime - Retry will be blocked forever (or until the lock timeout is reached for the previously queued locks) - Retry is problematic / unnecessary.

How typical is that the first request on the connection is the lock? If it happens often then the 1st situation could happen often. If we usually get the metadata for the table and then try to lock then the 1st situation is unlikely to happen.

I would prefer to keep the retry login in one place (RetryingMetaStoreClient) and simplify the Iceberg code by removing the retry altogether, but I can accept keeping the code as it is - we most probably will turn off the Iceberg retries in our deployment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pvary yea this PR disables the retry for Hive case, but leaves an option for turning it on to a particular caller, to address your concern.

About lock(), thanks for the reply, maybe you are right that it is not typical case to have TTransportException especially if getMetadata has already succeeded on same connection.

The only thing bothering me is that there is no way for the client to distinguish case 1 and 2 based on exception type, and thus nothing they can do really except retry. There is a ShowLock call, and potentially they can check if there is a queued lock from the failed call, is it it is their own , and if so then unlock it. But I'm not sure how reliable this is. Anyway its maybe for another review then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could make Hive lock() SafeToRetry?
I have to think this through, but if we release the enqueued lock in case of unexpected error inside implementation, and also delete the previous version of the lock if someone already created a lock for this transaction.

@jackye1995
Copy link
Contributor

  1. the title does not include the fact that we are switching to use RetryingMetaStoreClient, can we add that information in the title?
  2. I have not checked yet, but you mentioned about the fact that some configs can be used for both Iceberg and non-Iceberg tables. Does that mean on Iceberg side some properties has to be re-mapped back to Hive configs after the switch?

@szehon-ho szehon-ho force-pushed the retrying_hive_client_master branch from a3b52e5 to c9b0b87 Compare September 24, 2021 17:33
@szehon-ho szehon-ho changed the title Implement retryDelays and retries in Iceberg Hive Client Core: Switch to RetryingHMSClient (allows configuration of retryDelays and retries) Sep 24, 2021
@szehon-ho
Copy link
Collaborator Author

Thanks for the review guys. Addressed the feedback.

For @jackye1995's questions,

  1. Done
  2. I took a look but there should not be any existing Iceberg configs that need to be remapped. Currently it is already taking all the hive client configurations from hiveconf object, and anything else is Iceberg-specific. If i missed something, please do point out.

@szehon-ho szehon-ho changed the title Core: Switch to RetryingHMSClient (allows configuration of retryDelays and retries) Hive: Switch to RetryingHMSClient (allows configuration of retryDelays and retries) Sep 24, 2021
@szehon-ho
Copy link
Collaborator Author

@pvary thanks for review, addressed comments, sorry for delay

@pvary
Copy link
Contributor

pvary commented Oct 12, 2021

@szehon-ho: This version is fine by me. Let's wait a little bit for the others to chime in, in case they have any more concerns. If there is no objections till next Monday, I will merge then, as I will be OOO at the end of this week. Please ping me if I forget to push.

Thanks, Peter

@szehon-ho
Copy link
Collaborator Author

Thanks a lot Peter for following up.

@pvary pvary merged commit 2f4b30d into apache:master Oct 18, 2021
@pvary
Copy link
Contributor

pvary commented Oct 18, 2021

Merged.
Thanks for the PR @szehon-ho!

@rdblue
Copy link
Contributor

rdblue commented Oct 18, 2021

Thanks for getting this in, @pvary and @szehon-ho!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants