Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding retry logic to services in google-cloud-python #2694

Closed
daspecster opened this issue Nov 5, 2016 · 23 comments
Closed

Adding retry logic to services in google-cloud-python #2694

daspecster opened this issue Nov 5, 2016 · 23 comments
Assignees
Labels
api: core priority: p2 Moderately-important priority. Fix may not be included in next release.

Comments

@daspecster
Copy link
Contributor

daspecster commented Nov 5, 2016

I will see if I can find it, but I think there was a thread somewhere that @tseaver and @dhermes were talking about having retry be an option that you passed into the call to the service.

I think some of the other libraries just retry by default but I'll try and dig a little bit.

Issues involving retry:

@flundermicha
Copy link

That would be great, because retrying would also solve #2699 [logging]

@geigerj
Copy link
Contributor

geigerj commented Nov 28, 2016

@dhermes @daspecster To follow up on our offline discussion:

  • The GAPIC API object constructor takes as an optional parameter a retry configuration object (example in JSON; use a Python dict in the constructor instead).
  • This dict allows you to override the default retrying behavior for all GAPIC methods.
    • You can pass a partial configuration (i.e., configure some but not all of the methods in an API).
    • You can configure on which GRPC status codes to retry for which methods, as well as set the parameters for the exponential backoff retry algorithm.
  • Generally, by default, GAPIC methods retry automatically if the corresponding HTTP call uses a verb that is idempotent.
  • Every GAPIC package contains in the package data the default configuration (example).
  • You can change the retrying behavior on a per-method-call (as opposed to per-method) basis using the GAX CallOptions object. Every GAPIC method takes a CallOptions object as an optional parameter.

@daspecster
Copy link
Contributor Author

Another consideration is that we should have something for the REST implementations that we have. AFAICT GAPIC is only for gRPC right?

@geigerj
Copy link
Contributor

geigerj commented Nov 29, 2016

@daspecster Yes, GAPIC currently supports gRPC only. But for whatever you build for REST, it would be ideal to keep it compatible with the GAPIC retry model so that if we "switch out the plumbing" at at later date and support gRPC for those APIs, you won't have to change the surface or duplicate GAPIC functionality.

@bendemaree
Copy link

#2583 is also relevant to the retry discussion; as we still cannot properly determine which datastore requests to retry, perhaps it can be covered here.

@ghost
Copy link

ghost commented Jan 17, 2017

#2896 affects me as well, end of the stack trace I get is attached to that ticket.

@lukesneeringer
Copy link
Contributor

@geigerj What bugs are retried by default? (Looking at an internal bug where Datastore connections time out after 30 minutes, and auto-retry does not work.)

@geigerj
Copy link
Contributor

geigerj commented Jul 5, 2017

Assuming by "bugs" you mean status codes, by default we retry only on UNAVAILABLE and DEADLINE_EXCEEDED, and only for idempotent calls.

However, that configuration can be overridden at a per-service level (example: Pubsub).

It would be useful to know what status code Datastore is returning with the 30min timeout; we can add that to the GAPIC config in that case.

@theacodes
Copy link
Contributor

Now that tenacity is a dependency of google-cloud-core, it should be significantly easier to build in consistent retires for all libraries.

@xmedeko
Copy link

xmedeko commented Jul 20, 2017

From my point of view the retry logic should be simple and build just into the core, something like "retry when HTTP 5xx". Otherwise it will be hard to make it consistent across Python, Go, PHP, Java, C#, ... or a custom HTTP code. I understand it's not possible always, but the main logic should be in the core and the libraries should cover just some edge cases.
If you disagree, then I just hope all language libraries (Python, Go, PHP, Java, C#, ...) have this retrying logic consistent 😁

We use Python (googleapiclient, google-cloud-python), C# (google-cloud-dotnet) and PHP (first custom HTTP, then google-cloud-php) and always spent some time digging the docs and code how the library supports retrying.

@theacodes
Copy link
Contributor

@xmedeko I'm not sure what you mean. Our libraries do not share a common core across languages. We have to implement retry in each language's core library.

@xmedeko
Copy link

xmedeko commented Jul 21, 2017

Maybe I have misunderstood your post "it should be significantly easier to build in consistent retires for all libraries". I would expect the retry logic is just in the google-cloud-core and the libraries like google-cloud-storage, google-cloud-datastore, does not need implement it. And I just can set some core settings (e.g. NUM_RETRIES=6) to influence all libraries.

I understand there is not a shared core across languages libs. My concern is, that there should be share same logic (rules, principles) across languages libs. I have found different HTTP/JSON request retry logic when I have investigated it a half year ago.
Also, I miss some general doc about the retry logic for all Google APIs. E.g. I have found some for gsutil, storage. But from my experience, these 5xx error happens in other APIs, too. I would expect some general Google Cloud API Retry Handling Strategy doc and the other API docs would just link that. I know the doc issues does not belong here. Just to explain my previous post.

@lukesneeringer
Copy link
Contributor

I am closing this as @jonparrott is already working on retry as part of his google.api.core work.

@dojeda
Copy link

dojeda commented Feb 20, 2018

Hello,
I guess that there has been no update on this issue, since it has been closed and I could not find any other reference to this problem in another issue or PR.
I just wanted to leave a code example that I will be using to work around this problem, based on the backoff library for the retry logic (retryable can be used as well, but it seems easier to use backoff):

import backoff
import google.cloud.storage
from google.api_core import exceptions


class _RobustConnection(google.cloud.storage._http.Connection):
    @backoff.on_exception(backoff.expo, exceptions.ServerError, max_tries=5)
    def api_request(self, *args, **kwargs):
        return super(_RobustConnection, self).api_request(*args, **kwargs)


class RobustGoogleStorageClient(google.cloud.storage.client.Client):
    """A robust client for google.cloud.storage

    This will only work on regular requests/operations.
    It will *not* work on batch operations.
    """
    def __init__(self, *args, **kwargs):
        super(RobustGoogleStorageClient, self).__init__(*args, **kwargs)
        # Change directly the _base_connection member because changing the
        # _connection property will fail: the connection is already initialized
        # in the parent constructor
        self._base_connection = _RobustConnection(self)

I cannot throughly test this because I cannot generate 5xx errors, but I think this could help.

@chemelnucfin
Copy link
Contributor

Thank you for the work. We also should have a new retry framework in the api_core module in most (if not all) updated versions.

@dojeda
Copy link

dojeda commented Feb 21, 2018

Cool, I am looking forward for the retry framework.

Just an update: As I commented before, I've tried patching the client with relative success: it works for any request that uses the self._connection member, but other operations can still fail. Nevertheless, I think I solved 90% of my current problems with this approach.

My next candidate is Blob.upload_from_file, which uses a multipart or resumable upload.
This one is failing with a requests.exceptions.ConnectionError: ('Connection aborted.', BrokenPipeError(32, 'Broken pipe')), although it does not happen often.

I found that it would be just too much work to pin down and patch how requests are made internally on the Blob object. I managed to just wrap the only place where I am uploading files around a function that is decorated with a retry library, as my previous post.

@speedplane
Copy link

The retry framework is nice, but it would be nicer if it were a bit more configurable. Many of the defaults are less than ideal.

To change datastore retry options, I currently monkey patch as follows:

    try:
        from google.cloud.datastore_v1.gapic import datastore_client_config
    except ImportError:
        return
    logging.info("Overwitting Datastore Client Library Retry Config")
    datastore_client_config.config = { ... }

@theacodes
Copy link
Contributor

@speedplane unless you need to overwrite the defaults for all methods, I don't think you need to do that. Each method should allow you to pass in a retry argument.

@speedplane
Copy link

Yes, but at least in my project, this library is buried under multiple layers of middleware libraries, touching all of the calls is not so straightforward. Using global or a set config file would be nice. In the meantime I can just monkey-patch.

@speedplane
Copy link

For instance, the datastore client doesn’t have a configurable retry parameter (https://github.com/GoogleCloudPlatform/google-cloud-python/blob/master/datastore/google/cloud/datastore/client.py) and I have further libraries that sit on top of that.

@xmedeko
Copy link

xmedeko commented Apr 13, 2018

IMO it's nice to have a possibility to change the default value for all calls. I also increase default retries. (For the old google-api-python-client I use own subclass of googleapiclient.http.HttpRequest with the overridden execute method.)

@verasativa
Copy link

After reading this tread I'm still confused on how to fix this problem. Is there a config / parameter where to set the "Retry Strategy"? or other issue where is discussed?

@philsnow
Copy link

philsnow commented Nov 17, 2020

I ended up doing the same thing as @xmedeko, passing the subclass of googleapiclient.http.HttpRequest in the requestBuilder kwarg of googleapiclient.discovery.build, something like this:

class MyHttpRequest(googleapiclient.http.HttpRequest):
    def execute(self, http=None, num_retries=20):
        return super().execute(http=http, num_retries=num_retries)

creds = ...
sheets = googleapiclient.discovery.build('sheets', 'v4', credentials=creds, requestBuilder=MyHttpRequest)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: core priority: p2 Moderately-important priority. Fix may not be included in next release.
Projects
None yet
Development

No branches or pull requests