Skip to content

Conversation

@gopidesupavan
Copy link
Member

@gopidesupavan gopidesupavan commented May 30, 2024

Adding Amazon Glue Data Quality Service. Doc, Hook, Operator, Sensor, Trigger, Waiter, Unit Test, System Test.

GlueDataQualityOperator: Create ruleset or update ruleset

GlueDataQualityRuleSetEvaluationRunOperator: Execute evaluations on multiple rulesets.

Sample Dag for creating ruleset and execute evaluation:

from datetime import datetime
from airflow import DAG
from airflow.providers.amazon.aws.operators.glue import GlueDataQualityOperator, \
    GlueDataQualityRuleSetEvaluationRunOperator

with DAG(
    dag_id="example_glue_data_quality",
    schedule="@once",
    start_date=datetime(2021, 1, 1),
    tags=["glue data quality ruleset evaluation"],
    catchup=False,
) as dag:

    rule_set_name = "test_rule_set"

    create_rule_set = GlueDataQualityOperator(
        task_id="create_rule_set",
        name=rule_set_name,
        ruleset='Rules = [ColumnLength "name" between 3 and 14]',
        data_quality_ruleset_kwargs={
            "TargetTable": {
                "TableName": "test_table",
                "DatabaseName": "test_default",
            }
        }
    )

    start_evaluation_run = GlueDataQualityRuleSetEvaluationRunOperator(
        task_id="start_evaluation_run",
        datasource={
            "GlueTable": {
                 "TableName": "test_table",
                "DatabaseName": "test_default",
            }
        },
        role="arn:aws:iam::{ACCOUNT_ID}:role/GlueDataQuality",
        rule_set_names=[rule_set_name]
    )
image

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Copy link
Contributor

@vincbeck vincbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments but it is solid overall!

@gopidesupavan
Copy link
Member Author

@vincbeck Thank you for all the suggestions, made all the changes please review. Is the docstring sufficient for the log_results?. please let me know happy to refine. 😄 .

@vincbeck
Copy link
Contributor

@vincbeck Thank you for all the suggestions, made all the changes please review. Is the docstring sufficient for the log_results?. please let me know happy to refine. 😄 .

Yep, it looks good, it definitely helps to understand the function!

@vincbeck vincbeck merged commit 78523fd into apache:main May 31, 2024
fdemiane pushed a commit to fdemiane/airflow that referenced this pull request Jun 6, 2024
@gopidesupavan gopidesupavan deleted the glue-data-quality branch July 5, 2024 12:29
romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Jul 26, 2024
o-nikolas added a commit to aws-mwaa/upstream-to-airflow that referenced this pull request Nov 27, 2025
Pandas is used if the user optionally selects advanced output
processing when providing `show_results=True` (default is False) to
GlueDataQualityRuleSetEvaluationRunOperator and GlueDataQualityRuleSetEvaluationRunSensor

However, the original PR (apache#39923) adding these operators and sensors did not
include Pandas as a dependency of the Amazon Provider Package. I assume
this is because Pandas is quite a heavy dependency that we don't want
all users to have to install just for this very small usecase.
So this commit catches the exception and logs to the user rather than
failing catastrophically as it does now.
vincbeck pushed a commit that referenced this pull request Dec 1, 2025
Pandas is used if the user optionally selects advanced output
processing when providing `show_results=True` (default is False) to
GlueDataQualityRuleSetEvaluationRunOperator and GlueDataQualityRuleSetEvaluationRunSensor

However, the original PR (#39923) adding these operators and sensors did not
include Pandas as a dependency of the Amazon Provider Package. I assume
this is because Pandas is quite a heavy dependency that we don't want
all users to have to install just for this very small usecase.
So this commit catches the exception and logs to the user rather than
failing catastrophically as it does now.
RoyLee1224 pushed a commit to RoyLee1224/airflow that referenced this pull request Dec 3, 2025
Pandas is used if the user optionally selects advanced output
processing when providing `show_results=True` (default is False) to
GlueDataQualityRuleSetEvaluationRunOperator and GlueDataQualityRuleSetEvaluationRunSensor

However, the original PR (apache#39923) adding these operators and sensors did not
include Pandas as a dependency of the Amazon Provider Package. I assume
this is because Pandas is quite a heavy dependency that we don't want
all users to have to install just for this very small usecase.
So this commit catches the exception and logs to the user rather than
failing catastrophically as it does now.
Copilot AI pushed a commit to jason810496/airflow that referenced this pull request Dec 5, 2025
Pandas is used if the user optionally selects advanced output
processing when providing `show_results=True` (default is False) to
GlueDataQualityRuleSetEvaluationRunOperator and GlueDataQualityRuleSetEvaluationRunSensor

However, the original PR (apache#39923) adding these operators and sensors did not
include Pandas as a dependency of the Amazon Provider Package. I assume
this is because Pandas is quite a heavy dependency that we don't want
all users to have to install just for this very small usecase.
So this commit catches the exception and logs to the user rather than
failing catastrophically as it does now.
itayweb pushed a commit to itayweb/airflow that referenced this pull request Dec 6, 2025
Pandas is used if the user optionally selects advanced output
processing when providing `show_results=True` (default is False) to
GlueDataQualityRuleSetEvaluationRunOperator and GlueDataQualityRuleSetEvaluationRunSensor

However, the original PR (apache#39923) adding these operators and sensors did not
include Pandas as a dependency of the Amazon Provider Package. I assume
this is because Pandas is quite a heavy dependency that we don't want
all users to have to install just for this very small usecase.
So this commit catches the exception and logs to the user rather than
failing catastrophically as it does now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants