Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry failed models due to connection or server error #767

Closed
septponf opened this issue Aug 14, 2024 · 10 comments
Closed

Retry failed models due to connection or server error #767

septponf opened this issue Aug 14, 2024 · 10 comments
Labels
enhancement New feature or request

Comments

@septponf
Copy link

If a model fails due to intermittent failures not related to the model itself, it would be nice to have an auto retry.
For example, during the summer we have had some scheduled model failures due to "Remote end closed connection without response" or "Query could not be scheduled: HTTP Response code: 503. Please try again later. SQLSTATE: XX000".

For context we are executing DBT as a Databricks job using the DBT Task and SQL Serverless for compute.

For reference, I believe the bigquery adapter has such features.

@septponf septponf added the enhancement New feature or request label Aug 14, 2024
@benc-db
Copy link
Collaborator

benc-db commented Aug 14, 2024

In order to get that message, it generally has already retried for 15 minutes. What did you have in mind?

@septponf
Copy link
Author

In order to get that message, it generally has already retried for 15 minutes. What did you have in mind?

Ok, Well looking at the model execution timings it does not look like any retries where attempted.
See example log below.


...
03:35:29  Running with dbt=1.7.17
03:35:31  Registered adapter: databricks=1.7.10
...
03:42:45  134 of 158 START sql table model xx  [RUN]
03:42:46  134 of 158 ERROR creating sql table model xx  [ERROR in 1.09s]
...
03:43:47  Finished running 99 view models, 59 table models, 1 hook in 0 hours 6 minutes and 47.45 seconds (407.45s).
03:43:47  
03:43:47  Completed with 1 error and 0 warnings:
03:43:47  
03:43:47    Runtime Error in model xx (models/path/to/xx.sql)
  Runtime Error
    ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
03:43:47 

@benc-db
Copy link
Collaborator

benc-db commented Aug 15, 2024

Hmm, this is very strange, as it indicates the connection was actively broken; I don't think we have retry in that circumstance, but we can file a bug against databricks-sql-connector to remedy that. Does this happen often? If so, I would file a ticket with Databricks to understand why you're getting disconnected.

@septponf
Copy link
Author

Happened a couple of times in July, and tonight actually.
There is no entry in the query history, so I am assuming the query does not even get provisioned in the SQL engine.

It has been the same model each time which is strange. It has very simple logic so I am thinking it could be due to execution timing.
Maybe it is due to parallelism. I see in the dbt log that 7 models are starting at the same time (same second). We run dbt in 7 threads on a 6 DBU serverless cluster.

Could be that the request sometimes is not queued properly or that we are exceeding some API rate limit?
I will file a support ticket to investigate this further.

Nevertheless, it would be nice with an auto retry :-).
I am considering to add a dbt retry task after dbt run in case it fails.

@benc-db
Copy link
Collaborator

benc-db commented Aug 16, 2024

Would you mind filing against https://github.com/databricks/databricks-sql-python? Basically explain that we don't retry when we get 'Remote end closed connection without response', but that it should be safe to do so? In that package we aim to retry safe commands, i.e. ones that either are idempotent or that we know the server didn't receive, but in this case we have evidence that getting this response means the server didn't receive or otherwise that no action was taken. I will also take into consideration some version of model retry, but do not have capacity to explore right now.

@septponf
Copy link
Author

Ok. I filed a new issue.
databricks/databricks-sql-python#433

@septponf
Copy link
Author

I created a ticket with Microsoft to check if anything was going on server side that would cut the connection.
They consulted with the Databricks engineering team, and they found that it occurs when connections are reused after being idle for more than 180 seconds.

A max idle change is included in dbt-databricks version 1.7.14 that address the problem. We were using 1.7.10 at the time.

@NodeJSmith
Copy link

I can validate that we have not seen this issue since pinning to dbt-databricks == 1.8.5

@benc-db benc-db closed this as completed Oct 3, 2024
@cyberjar09
Copy link

cyberjar09 commented Oct 4, 2024

also, I have been using 1.7.16 and its been good! 👍 the situation improved but still seeing the issue crop up from time to time 😞

@benc-db you may need to reopen this

@bolinzzz
Copy link

we bumped dbt-databricks to 1.7.16 and it did not get rid of this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants