Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry connection on 'Remote end closed connection without response' #433

Open
septponf opened this issue Aug 29, 2024 · 7 comments
Open

Comments

@septponf
Copy link

septponf commented Aug 29, 2024

I initially reported this issue against dbt-databricks but was asked by a collaborator to file it here.

So basically, when running dbt request against a serverless sql warehouse, we get intermittent errors as below and the dag execution is aborted.
Runtime Error
('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

I believe it should be safe to retry connection and retry execution.

databricks-sql-connector==2.9.5

@kravets-levko
Copy link
Contributor

Hi @septponf! Thank you for reporting this issue. Yes, retry logic currently is not perfect, and each report like this helps to make it better. I need to explore your issue more, but generally retrying query execution is not safe, even on errors like terminated connection. See, if you submitted query, but didn't get the response from the server (for any reason) - you don't know on which stage query execution failed. Maybe server didn't even start processing the request, as well as it may be possible that query was successfully executed but server failed when sending the response to the library. And it's relatively safe to re-execute read queries (e.g. SELECT or SHOW TABLES, etc.), but it's definitely not safe to re-execute queries that update data.

We do have a list of errors that a safe to retry, and this library still doesn't fully implement it. Right now I'm doing another iteration to improve the retry logic of this library, and I will check what I can do in your case. But keep in mind what I explained above. In some cases, only user can decide what is safe to retry and what's not

@septponf
Copy link
Author

Thank you @kravets-levko for swift response.
I understand the general predicament.
But general retries for i.e. select x, show x, alter x, create or replace x, would be nice.

I appreciate you looking into this

@NodeJSmith
Copy link
Contributor

@kravets-levko Found this issue when looking into the same problem with dbt runs against serverless warehouses. I did want to add a comment - on the issue in the dbt-databricks repo @benc-db said the below

Would you mind filing against https://github.com/databricks/databricks-sql-python? Basically explain that we don't retry when we get 'Remote end closed connection without response', but that it should be safe to do so? In that package we aim to retry safe commands, i.e. ones that either are idempotent or that we know the server didn't receive, but in this case we have evidence that getting this response means the server didn't receive or otherwise that no action was taken. I will also take into consideration some version of model retry, but do not have capacity to explore right now.

You stated in your response to @septponf that we shouldn't retry these because it is not safe - does what @benc-db stated change that? It seems that the two of you are of differing opinions on if it is truly safe to retry here - if @benc-db is correct and we can be confident that this error means we were never able to try the query then we should be able to retry.

On the other hand, if you are correct and we cannot guarantee that it is safe to retry based on this error message then we can likely just close the issue on this repo, as it will by necessity need to be handled downstream (or so I would think).

@benc-db
Copy link
Collaborator

benc-db commented Sep 19, 2024

@NodeJSmith since I commented that, I have subsequently seen issues where the connection gets broken but the thrift server does schedule the command for execution ;(

@NodeJSmith
Copy link
Contributor

Damn, that's unfortunate. Would there be anyway to query the databricks API for the status of the query using the statement ID to attempt to retry based on that, like with the get_status call?

@benc-db
Copy link
Collaborator

benc-db commented Sep 19, 2024

If we have a statement id, does that mean it was scheduled? I think the core idea makes sense if we have the ID available in cases where we get disconnected. Don't reissue, but just check to see if the server knows about it. That might also fail, because the scenarios I'm thinking of, the server is so overloaded we stop getting responses, but it's something to try. @kravets-levko thoughts?

@bolinzzz
Copy link

Hi all, do you guys have any workaround for this? We are using databricks-sql-connector 2.9.5 and running into this issue quite frequently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants