New Optional dbt Cloud Job Operator Params #45634

ginone · 2025-01-14T10:34:08Z

Summary

Allow triggering dbt Cloud jobs using project_name, environment_name, and job_name instead of job_id.

It is not a breaking change, but optional, alternative way of triggering a job - both options will work:

# regular job run by id
trigger_job_run1 = DbtCloudRunJobOperator(
    task_id="trigger_job_run1",
    job_id=48617,
    check_interval=10,
    timeout=300,
)

# equivalent job run by name
trigger_job_run3 = DbtCloudRunJobOperator(
    task_id="trigger_job_run3",
    project_name="my_dbt_project",
    environment_name="prod",
    job_name="my_dbt_job",
    check_interval=10,
    timeout=300,
)

This is beneficial in dynamically configured environments (e.g. managed by Infrastructure as Code) when job_id is not known upfront (or may change over time) and therefore hardcoding it is not convenient.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

boring-cyborg · 2025-01-14T10:34:13Z

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
Be sure to read the Airflow Coding style.
Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
Apache Airflow is a community-driven project and together we are making it better 🚀.
In case of doubts contact the developers at:
Mailing List: dev@airflow.apache.org
Slack: https://s.apache.org/airflow-slack

jaklan · 2025-01-14T14:46:38Z

@joellabes another MR related to dbt Cloud Operator from our side - could you have a look if you are fine with the approach?

pikachuev · 2025-01-16T13:28:04Z

Hi @ginone , great proposal and change. We use in our Airflow environment an own helper and retrieve job_ids by names. We figure out that in our case the payload from dbt Cloud REST API is quite big to retrieve it for every dbt cloud job triggering. To reduce payload size we additionally filter rest api calls by filtering project name in API call, for example:
project_url = (
f"{base_url_v3}accounts/{dbt_cloud_account_id}/projects/"
f"?account_id={dbt_cloud_account_id}"
f"&name__icontains={project_name}"
)

or
job_url = (
f"{base_url_v2}accounts/{dbt_cloud_account_id}/jobs/"
f"?account_id={dbt_cloud_account_id}"
f"&project_id={project_id}"
f"&name__icontains={job_name}"
)
What do you think about such improvement?

ginone · 2025-01-16T13:57:03Z

@pikachuev thanks for the feedback!

If you look at the code, get_job_by_name function specifically, you will notice there already is an optimization that works similarly to your suggestion. In order to minimize the response sizes, the process is done in 3 steps:

list_projects using project_name as filter (name__icontains={project_name})
list_environments using project_id as a filter
list_jobs using both project_id & environment_id as filters

I decided to only use name__icontains for first step to limit the response size as retrieving all projects would not make sense. In 2nd & 3rd stages my assumption was that it should not be necessary to use such filtering because there should be no more than a few environments in each project and a dozen or so jobs in each environment.

Now that you mentioned it, I think it would make sense to use name__icontains at least also for retrieving jobs, as there may be use cases when there are 100+ jobs in a single environment, so I'll submit that change shortly.

jaklan

Except of these very minor comments below - I think about one more approach.

Replace:

        self.job_id = job_id
        self.project_name = project_name
        self.environment_name = environment_name
        self.job_name = job_name

with*:

        self.job = job
        self.project = project
        self.environment = environment

where all these parameters can be either string or int.

if job is int - that's all we need
if it's string - we need project and environment
if any of these two is int - it's passed directly to get_job_by_name as resource ID
if any of these two is string - resource ID is retrieved by calling get_<project | environment>_by_name

This way we allow for more flexible setup (e.g. job by name, environment and project by id). Ofc instead of checking the type - we could keep separate parameters explicitly (== have <resource>_id and <resource>_name params for all three resources), but then we would need to either:

add logic for mutually exclusive params
document that <resource>_id has always higher priority than <resource>_name

*for backward compatibility we would need to keep job_id parameter, but we could mark it as deprecated

providers/src/airflow/providers/dbt/cloud/hooks/dbt.py

providers/src/airflow/providers/dbt/cloud/operators/dbt.py

providers/src/airflow/providers/dbt/cloud/hooks/dbt.py

joellabes · 2025-01-23T07:09:31Z

could you have a look if you are fine with the approach?

I'm sorry that it's necessary (and have linked this internally for us to work on a better strategy!) but I think it's a sensible workaround for now, and is a pattern we've seen customers use before 👍

ginone · 2025-01-27T09:47:13Z

@josh-fell @Lee-W thank you for the code review, I've addressed all feedback. Could you please look again?

BTW. What do you think regarding @jaklan suggestion (#45634 (review))? My thoughts:

this (implicit argument types) feels wrong/confusing:

self.job: str | int,
self.project: str | int,
self.environment: str | int,

I think the params should be explicit, so I would go with:

self.job_id: int,
self.project_id: int,
self.environment_id: int,
self.job_name: str,
self.project_name: str,
self.environment_name: str,

I see this as an enhancement that could allow more flexible setups, but I would prefer to release this PR "as is" and unblock using project_name, environment_name, and job_name ASAP. I would create a new PR with the suggested enhancements.

ginone · 2025-01-28T16:34:43Z

@josh-fell @Lee-W thank you for the code review, I've addressed all feedback. Could you please look again?

BTW. What do you think regarding @jaklan suggestion (#45634 (review))? My thoughts:

this (implicit argument types) feels wrong/confusing:
self.job: str | int,
self.project: str | int,
self.environment: str | int,
I think the params should be explicit, so I would go with:
self.job_id: int,
self.project_id: int,
self.environment_id: int,
self.job_name: str,
self.project_name: str,
self.environment_name: str,
I see this as an enhancement that could allow more flexible setups, but I would prefer to release this PR "as is" and unblock using project_name, environment_name, and job_name ASAP. I would create a new PR with the suggested enhancements.

New PR that includes the changes described above: #46184

Feel free to select just one of my PRs and close the other one.

Lee-W · 2025-02-03T08:11:09Z

@josh-fell @Lee-W thank you for the code review, I've addressed all feedback. Could you please look again?

BTW. What do you think regarding @jaklan suggestion (#45634 (review))? My thoughts:

1. this (implicit argument types) feels wrong/confusing:

self.job: str | int,
self.project: str | int,
self.environment: str | int,

I think the params should be explicit, so I would go with:

self.job_id: int,
self.project_id: int,
self.environment_id: int,
self.job_name: str,
self.project_name: str,
self.environment_name: str,

2. I see this as an enhancement that could allow more flexible setups, but I would prefer to release this PR "as is" and unblock using `project_name`, `environment_name`, and `job_name` ASAP. I would create a new PR with the suggested enhancements.

Hi, sorry for the late reply. I prefer this idea more. It would be great if you could rebase from the main branch and resolve the conflict, then I'll review it again. Thanks a lot!

ginone · 2025-02-04T09:07:27Z

@Lee-W rebasing is done, please have a look again. Thanks!

Lee-W

Looks good to me. Thanks for your prompt reply. I'll keep it open for one to two days so others can take a look.

…nd list_jobs methods

…ignatures to use keyword-only arguments

ginone · 2025-02-04T11:02:37Z

Looks good to me. Thanks for your prompt reply. I'll keep it open for one to two days so others can take a look.

Fixed one incorrect "include" path in docs. Checks should be all green now.

Thanks!

boring-cyborg · 2025-02-04T15:58:49Z

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

* Update DbtCloudRunJobOperator and corresponding tests * update documentation * Fixes after manual tests * add unit test for duplicate names * Enhance DbtCloudHook to support name filtering in list_environments and list_jobs methods * Improve clarity by renaming variables and enhancing comments * Refactor dbt Cloud hook and operator error handling & update method signatures to use keyword-only arguments * Correct get_job_by_name docstring * Fix include path for dbt Cloud documentation --------- Co-authored-by: Marcin Sitnik <marcin.sitnik@roche.com>

ginone requested a review from josh-fell as a code owner January 14, 2025 10:34

boring-cyborg bot added area:providers kind:documentation provider:dbt-cloud labels Jan 14, 2025

ginone force-pushed the optional-dbt-cloud-job-operator-params branch from 358ba15 to c93f3fe Compare January 15, 2025 10:56

ginone force-pushed the optional-dbt-cloud-job-operator-params branch from c93f3fe to 4ac4e96 Compare January 16, 2025 14:51

jaklan reviewed Jan 16, 2025

View reviewed changes

ginone force-pushed the optional-dbt-cloud-job-operator-params branch from 4ac4e96 to ea9230c Compare January 17, 2025 07:54

Lee-W reviewed Jan 22, 2025

View reviewed changes

josh-fell reviewed Jan 22, 2025

View reviewed changes

providers/src/airflow/providers/dbt/cloud/hooks/dbt.py Show resolved Hide resolved

ginone force-pushed the optional-dbt-cloud-job-operator-params branch from ea9230c to 167523d Compare January 27, 2025 09:24

ginone requested review from Lee-W and josh-fell January 27, 2025 09:47

ginone force-pushed the optional-dbt-cloud-job-operator-params branch 2 times, most recently from e77dff0 to 2457ee9 Compare January 28, 2025 16:10

ginone mentioned this pull request Jan 28, 2025

New Optional dbt Cloud Job Operator Params - More Flexible Lookup #46184

Closed

ginone force-pushed the optional-dbt-cloud-job-operator-params branch from 2457ee9 to f41e5b3 Compare January 30, 2025 15:00

ginone force-pushed the optional-dbt-cloud-job-operator-params branch from f41e5b3 to c4d8752 Compare February 4, 2025 08:52

Lee-W approved these changes Feb 4, 2025

View reviewed changes

Marcin Sitnik added 2 commits February 4, 2025 12:00

Update DbtCloudRunJobOperator and corresponding tests

eb12982

update documentation

a12d154

Marcin Sitnik and others added 7 commits February 4, 2025 12:00

Fixes after manual tests

00d50f1

add unit test for duplicate names

5132e06

Enhance DbtCloudHook to support name filtering in list_environments a…

f234c4c

…nd list_jobs methods

Improve clarity by renaming variables and enhancing comments

2375ef6

Refactor dbt Cloud hook and operator error handling & update method s…

06fd373

…ignatures to use keyword-only arguments

Correct get_job_by_name docstring

5788bf8

Fix include path for dbt Cloud documentation

670aa4f

ginone force-pushed the optional-dbt-cloud-job-operator-params branch from c4d8752 to 670aa4f Compare February 4, 2025 11:00

potiuk approved these changes Feb 4, 2025

View reviewed changes

potiuk merged commit ca77120 into apache:main Feb 4, 2025
61 checks passed

potiuk mentioned this pull request Feb 21, 2025

Status of testing Providers that were prepared on February 21, 2025 #46973

Closed

ginone deleted the optional-dbt-cloud-job-operator-params branch July 30, 2025 12:17

New Optional dbt Cloud Job Operator Params #45634

New Optional dbt Cloud Job Operator Params #45634

Uh oh!

Conversation

ginone commented Jan 14, 2025

Summary

Uh oh!

boring-cyborg bot commented Jan 14, 2025

Uh oh!

jaklan commented Jan 14, 2025

Uh oh!

pikachuev commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ginone commented Jan 16, 2025

Uh oh!

jaklan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joellabes commented Jan 23, 2025

Uh oh!

ginone commented Jan 27, 2025

Uh oh!

ginone commented Jan 28, 2025

Uh oh!

Lee-W commented Feb 3, 2025

Uh oh!

ginone commented Feb 4, 2025

Uh oh!

Lee-W left a comment

Choose a reason for hiding this comment

Uh oh!

ginone commented Feb 4, 2025

Uh oh!

Uh oh!

boring-cyborg bot commented Feb 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

pikachuev commented Jan 16, 2025 •

edited

Loading

jaklan left a comment •

edited

Loading