Skip to content

DAG for running and storing results of long queries #4595

Open

Description

Problem

While Grafana (and other tools like it) provide good interfaces for visualising data, they are not good for long queries. Those tools rely on reasonable time outs configured for their data sources to prevent inexpert (or unaware) users of the tools from taking down a database or Elasticsearch cluster (for example) with a big query.

Description

Instead, let's create a DAG for running these long queries. The DAG should start by being able to run queries against the catalog database, and it should be restricted to non-destructive queries. To that end, we can use a separate user to execute the query against the database (a user with readonly permissions).

The DAG should accept a string for the query and a note explaining the purpose of the query, and under what circumstances it can or cannot be terminated.

Basically, this would just be the "select_query" part of the batched update DAG.

Alternatives

Maybe there are better ways for long running queries? I see some mention of using materialized views for them, but I can't find anything super clear on the matter.

@AetherUnbound @stacimc y'all's advice here would be helpful. The main thing I want to solve is not needing to open direct connections to the databases for long queries, with the understanding that some long queries are necessary and unavoidable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    🌟 goal: additionAddition of new feature🟨 priority: mediumNot blocking but should be addressed soon🤖 aspect: dxConcerns developers' experience with the codebase🧱 stack: catalogRelated to the catalog and Airflow DAGs

    Type

    No type

    Projects

    • Status

      📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions