Hive: Switch to RetryingHMSClient (allows configuration of retryDelays and retries) #3099

szehon-ho · 2021-09-10T17:14:45Z

The options are: implement it in ClientPoolImpl, or use RetryingMetaStoreClient. From the initial discussion, leaning towards option 2. Some justifications below :

RetryingMetaStoreClient is used today in Hive and Spark, and is more battle-tested. HiveClientPool will have to catch up to all the Hive exception types to retry: Hive: Reconnect client if connection exception is wrapped in RuntimeE… #2844 for some missing exceptions
Handles UGI impersonation logic for reconnect, which is missing HiveClientPool (needed in Kerberized environments)
ClientPoolImpl does not support any configuration of retry and retry-backoff. It hard-codes to 2 attempts, 0 backoff, (the default in RetryingMetaStoreClient is 1s backoff, 3 retries for instance)
Re-using RetryingMetaStoreClient can unify all hive configs for the execution engine, instead of having the configs per catalog. For instance in Spark, setting 'spark.hadoop.hive.metastore.client.connect.retry.delay' will set it for all Hive connections (for Iceberg and non-Iceberg tables)

Implementation details: adding RetryingMetaStoreClient will make redundant the ClientPoolImpl retry, so disable it for this case, but not remove it to preserve for other client pools.

szehon-ho · 2021-09-13T23:20:48Z

Retriggering, looks like past hang due to: #3110

szehon-ho · 2021-09-14T02:53:20Z

@pvary and @rdblue the change as we discussed, if you guys want to take a look.

After doing some tests with it, one unexpected side effect to call out for reviewers is that due to https://issues.apache.org/jira/browse/HIVE-13014, lock() is now explicitly not allowed to retry (ref: https://github.com/apache/hive/blob/master/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java#L3413).

On the one hand, correct-ness wise it makes sense to avoid retry (its possible the server actually got the lock but does not return to the client, leaving a dangling lock). On the other hand, it's a bit of a bummer as lock() is called quite frequently on Iceberg side on every commit and is a bit prone to random failures, ie https://issues.apache.org/jira/browse/HIVE-23052.

szehon-ho · 2021-09-16T19:44:36Z

By the way, update, lock() is not particularly more suspectible to random failure. The issue for us about lock was actually a deterministic one: https://issues.apache.org/jira/browse/HIVE-25522. (a corrupted HMS instance due to bad startup). So not sure how big of an issue that is anymore (RetryingMetaStoreClient not retrying lock() calls for safety reason).

rdblue · 2021-09-20T00:11:31Z

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveClientPool.java

+  private static final DynMethods.StaticMethod CLIENT_CTOR = DynMethods.builder("getProxy")
+      .impl(RetryingMetaStoreClient.class, HiveConf.class)
+      .impl(RetryingMetaStoreClient.class, HiveConf.class, Boolean.TYPE)
+      .impl(RetryingMetaStoreClient.class, Configuration.class, Boolean.TYPE)


Order matters here. The first implementation found will be used, so this will use getProxy(HiveConf) before it will use getProxy(HiveConf, Boolean). Is that intended?

It seems fine, newer versions don't even have the first constructor:

https://hive.apache.org/javadocs/r3.1.2/api/org/apache/hadoop/hive/metastore/RetryingMetaStoreClient.html

https://hive.apache.org/javadocs/r2.1.1/api/org/apache/hadoop/hive/metastore/RetryingMetaStoreClient.html

Yes, thanks again @jackye1995 for the pointers.

And the last link is Hive1 API (which gets pulled in in spark2 tests): https://hive.apache.org/javadocs/r1.2.2/api/org/apache/hadoop/hive/metastore/RetryingMetaStoreClient.html

rdblue · 2021-09-20T00:12:58Z

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveClientPool.java

    try {
      try {
-        return CLIENT_CTOR.newInstance(hiveConf);
+        return CLIENT_CTOR.invoke(hiveConf, true);


Can you please add a comment that documents the true here? Something like this:

return GET_CLIENT.invoke(hiveConf, true /* use retries */);

I'm assuming that the true is retry related? If so, should this detect whether the Hive implementation will retry and turn retries back on?

it's not about retry, but to configure allowEmbedded for the HiveMetaStoreClient initialized internally in the RetryingMetaStoreClient:

https://github.com/apache/hive/blob/9f03e3c416834b18faf7a85338b130ed0742275f/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java#L165-L167

Yes, thanks @jackye1995 for the pointer. From quick glance, seems a flag just intended to determine whether to throw an exception if "hive.metastore.uris" is not set, so not terribly useful to document, unless you feel otherwise.

rdblue · 2021-09-20T00:13:25Z

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveClientPool.java

-          .build();
+  // use appropriate ctor depending on whether we're working with Hive1, Hive2, or Hive3 dependencies
+  // we need to do this because there is a breaking API change between Hive1, Hive2, and Hive3
+  private static final DynMethods.StaticMethod CLIENT_CTOR = DynMethods.builder("getProxy")


This isn't really a client constructor anymore. Could you rename this GET_CLIENT or something?

rdblue · 2021-09-20T00:16:44Z

hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveClientPool.java

@@ -110,57 +103,4 @@ public void testGetTablesFailsForNonReconnectableException() throws Exception {
    AssertHelpers.assertThrows("Should throw exception", MetaException.class,
            "Another meta exception", () -> clients.run(client -> client.getTables("default", "t")));
  }
-
-  @Test
-  public void testConnectionFailureRestoreForMetaException() throws Exception {


These tests seem valuable if the client doesn't retry on its own. What about passing in an option to turn on retries and then using that to test retries here?

Good idea, done. Put the test back.

Side note: the flexibility of making it configurable is nice, and is a potential solution to the issue I mentioned, I am tempted to set it to true in HiveClientOperations::acquireLock in this line.

LockResponse lockResponse = metaClients.run(client -> client.lock(lockRequest));

I can see from Hive point of view why they don't want to auto-retry, but most exception scenarios are not going to be in the corner case of having succeeded in the backend and leaving the lock. In all honesty, if the Iceberg job fails because lock() throws TTransportException, end user will just try again but it is a wasted effort. Not sure if any thoughts on it?
cc @pvary

Having both Iceberg and Hive retry could also be problematic / surprising. We will just do NxM retries (2x2 in the default case).

I see 2 typical error scenario in case of lock:

Can not reach the HMS - connection is broken, or something like this - Retry is a perfect solution

The locks are queued, but we have to wait and something happens in the meantime - Retry will be blocked forever (or until the lock timeout is reached for the previously queued locks) - Retry is problematic / unnecessary.

How typical is that the first request on the connection is the lock? If it happens often then the 1st situation could happen often. If we usually get the metadata for the table and then try to lock then the 1st situation is unlikely to happen.

I would prefer to keep the retry login in one place (RetryingMetaStoreClient) and simplify the Iceberg code by removing the retry altogether, but I can accept keeping the code as it is - we most probably will turn off the Iceberg retries in our deployment.

@pvary yea this PR disables the retry for Hive case, but leaves an option for turning it on to a particular caller, to address your concern.

About lock(), thanks for the reply, maybe you are right that it is not typical case to have TTransportException especially if getMetadata has already succeeded on same connection.

The only thing bothering me is that there is no way for the client to distinguish case 1 and 2 based on exception type, and thus nothing they can do really except retry. There is a ShowLock call, and potentially they can check if there is a queued lock from the failed call, is it it is their own , and if so then unlock it. But I'm not sure how reliable this is. Anyway its maybe for another review then.

Maybe we could make Hive lock() SafeToRetry?
I have to think this through, but if we release the enqueued lock in case of unexpected error inside implementation, and also delete the previous version of the lock if someone already created a lock for this transaction.

jackye1995 · 2021-09-22T07:05:50Z

the title does not include the fact that we are switching to use RetryingMetaStoreClient, can we add that information in the title?
I have not checked yet, but you mentioned about the fact that some configs can be used for both Iceberg and non-Iceberg tables. Does that mean on Iceberg side some properties has to be re-mapped back to Hive configs after the switch?

szehon-ho · 2021-09-24T17:44:07Z

Thanks for the review guys. Addressed the feedback.

For @jackye1995's questions,

Done
I took a look but there should not be any existing Iceberg configs that need to be remapped. Currently it is already taking all the hive client configurations from hiveconf object, and anything else is Iceberg-specific. If i missed something, please do point out.

core/src/main/java/org/apache/iceberg/ClientPoolImpl.java

hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveClientPool.java

szehon-ho · 2021-10-11T21:08:37Z

@pvary thanks for review, addressed comments, sorry for delay

pvary · 2021-10-12T09:06:59Z

@szehon-ho: This version is fine by me. Let's wait a little bit for the others to chime in, in case they have any more concerns. If there is no objections till next Monday, I will merge then, as I will be OOO at the end of this week. Please ping me if I forget to push.

Thanks, Peter

szehon-ho · 2021-10-12T20:26:57Z

Thanks a lot Peter for following up.

pvary · 2021-10-18T11:35:22Z

Merged.
Thanks for the PR @szehon-ho!

rdblue · 2021-10-18T15:28:28Z

Thanks for getting this in, @pvary and @szehon-ho!

github-actions bot added core hive labels Sep 10, 2021

szehon-ho mentioned this pull request Sep 10, 2021

Implement retryDelays and retries in Iceberg Hive Client #3087

Closed

szehon-ho closed this Sep 10, 2021

szehon-ho reopened this Sep 10, 2021

szehon-ho closed this Sep 13, 2021

szehon-ho reopened this Sep 13, 2021

szehon-ho requested review from marton-bod, rdblue and pvary September 14, 2021 02:57

rdblue reviewed Sep 20, 2021

View reviewed changes

szehon-ho added 3 commits September 23, 2021 22:39

Use RetryingMetaStoreClient for Hive connections

2deb308

Fix spark2/hive1 setup

3a3e9ae

Address review comments

c9b0b87

szehon-ho force-pushed the retrying_hive_client_master branch from a3b52e5 to c9b0b87 Compare September 24, 2021 17:33

szehon-ho changed the title ~~Implement retryDelays and retries in Iceberg Hive Client~~ Core: Switch to RetryingHMSClient (allows configuration of retryDelays and retries) Sep 24, 2021

szehon-ho changed the title ~~Core: Switch to RetryingHMSClient (allows configuration of retryDelays and retries)~~ Hive: Switch to RetryingHMSClient (allows configuration of retryDelays and retries) Sep 24, 2021

pvary reviewed Oct 5, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/ClientPoolImpl.java Outdated Show resolved Hide resolved

pvary reviewed Oct 5, 2021

View reviewed changes

hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveClientPool.java Outdated Show resolved Hide resolved

Address review comments

dac8b87

pvary merged commit 2f4b30d into apache:master Oct 18, 2021

szehon-ho mentioned this pull request Oct 21, 2021

Hive: Reconnect client if connection exception is wrapped in RuntimeE… #2844

Closed

Reo-LEI mentioned this pull request Oct 25, 2021

Hive: Got NoSuchMethodException when create HiveMetaStoreClient in hive-metastore-2.1.1. #3363

Closed

szehon-ho mentioned this pull request Jan 14, 2022

Make retry number and backoff policy configurable #3903

Closed

bluzy mentioned this pull request Feb 6, 2023

Failed to get table info from metastore using impersonation #6750

Closed

singhpk234 mentioned this pull request May 9, 2023

Core: Fix retry behavior for Jdbc Client #7561

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hive: Switch to RetryingHMSClient (allows configuration of retryDelays and retries) #3099

Hive: Switch to RetryingHMSClient (allows configuration of retryDelays and retries) #3099

szehon-ho commented Sep 10, 2021 •

edited

Loading

szehon-ho commented Sep 13, 2021 •

edited

Loading

szehon-ho commented Sep 14, 2021 •

edited

Loading

szehon-ho commented Sep 16, 2021 •

edited

Loading

rdblue Sep 20, 2021

jackye1995 Sep 22, 2021

szehon-ho Sep 24, 2021

szehon-ho Sep 24, 2021 •

edited

Loading

rdblue Sep 20, 2021

jackye1995 Sep 22, 2021

szehon-ho Sep 24, 2021

rdblue Sep 20, 2021

szehon-ho Sep 24, 2021

rdblue Sep 20, 2021

szehon-ho Sep 24, 2021 •

edited

Loading

pvary Sep 27, 2021

szehon-ho Sep 28, 2021

pvary Oct 1, 2021

jackye1995 commented Sep 22, 2021

szehon-ho commented Sep 24, 2021

szehon-ho commented Oct 11, 2021

pvary commented Oct 12, 2021

szehon-ho commented Oct 12, 2021

pvary commented Oct 18, 2021

rdblue commented Oct 18, 2021

Hive: Switch to RetryingHMSClient (allows configuration of retryDelays and retries) #3099

Hive: Switch to RetryingHMSClient (allows configuration of retryDelays and retries) #3099

Conversation

szehon-ho commented Sep 10, 2021 • edited Loading

szehon-ho commented Sep 13, 2021 • edited Loading

szehon-ho commented Sep 14, 2021 • edited Loading

szehon-ho commented Sep 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho Sep 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho Sep 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackye1995 commented Sep 22, 2021

szehon-ho commented Sep 24, 2021

szehon-ho commented Oct 11, 2021

pvary commented Oct 12, 2021

szehon-ho commented Oct 12, 2021

pvary commented Oct 18, 2021

rdblue commented Oct 18, 2021

szehon-ho commented Sep 10, 2021 •

edited

Loading

szehon-ho commented Sep 13, 2021 •

edited

Loading

szehon-ho commented Sep 14, 2021 •

edited

Loading

szehon-ho commented Sep 16, 2021 •

edited

Loading

szehon-ho Sep 24, 2021 •

edited

Loading

szehon-ho Sep 24, 2021 •

edited

Loading