Skip to content

feat: Retry Clickhouse queries in Dagster jobs if they fail due to cluster-wide memory limits #31015

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 9, 2025

Conversation

tkaemming
Copy link
Contributor

Problem

Sometimes administrative queries fail because other queries on the server are eating all the memory, e.g. https://dagster.prod-eu.posthog.dev/runs/57d97122-b477-43e1-a4db-02570906aea7

Changes

Retries queries if they are killed due to server-wide memory limits. Queries are not retried if they exceed single query memory limits.

Does this work well for both Cloud and self-hosted?

Yes

How did you test this code?

🙃

@tkaemming tkaemming requested a review from a team April 9, 2025 17:57
@tkaemming tkaemming marked this pull request as ready for review April 9, 2025 17:57
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

The changes implement a retry mechanism in Dagster jobs to handle Clickhouse queries failing due to cluster-wide memory limits. This distinction ensures that only transient cluster memory issues trigger a retry.

  • Modified /dags/common.py to include a lambda function detecting "Memory limit (total) exceeded" errors.
  • Distinguishes retries for cluster-wide memory limits versus single query limits.
  • Improves system resiliency by automatically retrying affected queries.

1 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings | Greptile

@tkaemming tkaemming merged commit 8afe134 into master Apr 9, 2025
97 checks passed
@tkaemming tkaemming deleted the retry-on-total-memory-limit branch April 9, 2025 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants