Skip to content

Conversation

@ginone
Copy link
Contributor

@ginone ginone commented Jan 14, 2025

Summary

Allow triggering dbt Cloud jobs using project_name, environment_name, and job_name instead of job_id.

It is not a breaking change, but optional, alternative way of triggering a job - both options will work:

# regular job run by id
trigger_job_run1 = DbtCloudRunJobOperator(
    task_id="trigger_job_run1",
    job_id=48617,
    check_interval=10,
    timeout=300,
)

# equivalent job run by name
trigger_job_run3 = DbtCloudRunJobOperator(
    task_id="trigger_job_run3",
    project_name="my_dbt_project",
    environment_name="prod",
    job_name="my_dbt_job",
    check_interval=10,
    timeout=300,
)

This is beneficial in dynamically configured environments (e.g. managed by Infrastructure as Code) when job_id is not known upfront (or may change over time) and therefore hardcoding it is not convenient.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg
Copy link

boring-cyborg bot commented Jan 14, 2025

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@jaklan
Copy link

jaklan commented Jan 14, 2025

@joellabes another MR related to dbt Cloud Operator from our side - could you have a look if you are fine with the approach?

@ginone ginone force-pushed the optional-dbt-cloud-job-operator-params branch from 358ba15 to c93f3fe Compare January 15, 2025 10:56
@pikachuev
Copy link

pikachuev commented Jan 16, 2025

Hi @ginone , great proposal and change. We use in our Airflow environment an own helper and retrieve job_ids by names. We figure out that in our case the payload from dbt Cloud REST API is quite big to retrieve it for every dbt cloud job triggering. To reduce payload size we additionally filter rest api calls by filtering project name in API call, for example:
project_url = (
f"{base_url_v3}accounts/{dbt_cloud_account_id}/projects/"
f"?account_id={dbt_cloud_account_id}"
f"&name__icontains={project_name}"
)

or
job_url = (
f"{base_url_v2}accounts/{dbt_cloud_account_id}/jobs/"
f"?account_id={dbt_cloud_account_id}"
f"&project_id={project_id}"
f"&name__icontains={job_name}"
)
What do you think about such improvement?

@ginone
Copy link
Contributor Author

ginone commented Jan 16, 2025

@pikachuev thanks for the feedback!

If you look at the code, get_job_by_name function specifically, you will notice there already is an optimization that works similarly to your suggestion. In order to minimize the response sizes, the process is done in 3 steps:

  1. list_projects using project_name as filter (name__icontains={project_name})
  2. list_environments using project_id as a filter
  3. list_jobs using both project_id & environment_id as filters

I decided to only use name__icontains for first step to limit the response size as retrieving all projects would not make sense. In 2nd & 3rd stages my assumption was that it should not be necessary to use such filtering because there should be no more than a few environments in each project and a dozen or so jobs in each environment.

Now that you mentioned it, I think it would make sense to use name__icontains at least also for retrieving jobs, as there may be use cases when there are 100+ jobs in a single environment, so I'll submit that change shortly.

@ginone ginone force-pushed the optional-dbt-cloud-job-operator-params branch from c93f3fe to 4ac4e96 Compare January 16, 2025 14:51
Copy link

@jaklan jaklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except of these very minor comments below - I think about one more approach.

Replace:

        self.job_id = job_id
        self.project_name = project_name
        self.environment_name = environment_name
        self.job_name = job_name

with*:

        self.job = job
        self.project = project
        self.environment = environment

where all these parameters can be either string or int.

  • if job is int - that's all we need
  • if it's string - we need project and environment
  • if any of these two is int - it's passed directly to get_job_by_name as resource ID
  • if any of these two is string - resource ID is retrieved by calling get_<project | environment>_by_name

This way we allow for more flexible setup (e.g. job by name, environment and project by id). Ofc instead of checking the type - we could keep separate parameters explicitly (== have <resource>_id and <resource>_name params for all three resources), but then we would need to either:

  • add logic for mutually exclusive params
  • document that <resource>_id has always higher priority than <resource>_name

*for backward compatibility we would need to keep job_id parameter, but we could mark it as deprecated

@ginone ginone force-pushed the optional-dbt-cloud-job-operator-params branch from 4ac4e96 to ea9230c Compare January 17, 2025 07:54
@joellabes
Copy link
Contributor

could you have a look if you are fine with the approach?

I'm sorry that it's necessary (and have linked this internally for us to work on a better strategy!) but I think it's a sensible workaround for now, and is a pattern we've seen customers use before 👍

@ginone ginone force-pushed the optional-dbt-cloud-job-operator-params branch from ea9230c to 167523d Compare January 27, 2025 09:24
@ginone
Copy link
Contributor Author

ginone commented Jan 27, 2025

@josh-fell @Lee-W thank you for the code review, I've addressed all feedback. Could you please look again?

BTW. What do you think regarding @jaklan suggestion (#45634 (review))? My thoughts:

  1. this (implicit argument types) feels wrong/confusing:
self.job: str | int,
self.project: str | int,
self.environment: str | int,

I think the params should be explicit, so I would go with:

self.job_id: int,
self.project_id: int,
self.environment_id: int,
self.job_name: str,
self.project_name: str,
self.environment_name: str,
  1. I see this as an enhancement that could allow more flexible setups, but I would prefer to release this PR "as is" and unblock using project_name, environment_name, and job_name ASAP. I would create a new PR with the suggested enhancements.

@ginone ginone requested review from Lee-W and josh-fell January 27, 2025 09:47
@ginone ginone force-pushed the optional-dbt-cloud-job-operator-params branch 2 times, most recently from e77dff0 to 2457ee9 Compare January 28, 2025 16:10
@ginone
Copy link
Contributor Author

ginone commented Jan 28, 2025

@josh-fell @Lee-W thank you for the code review, I've addressed all feedback. Could you please look again?

BTW. What do you think regarding @jaklan suggestion (#45634 (review))? My thoughts:

  1. this (implicit argument types) feels wrong/confusing:
self.job: str | int,
self.project: str | int,
self.environment: str | int,

I think the params should be explicit, so I would go with:

self.job_id: int,
self.project_id: int,
self.environment_id: int,
self.job_name: str,
self.project_name: str,
self.environment_name: str,
  1. I see this as an enhancement that could allow more flexible setups, but I would prefer to release this PR "as is" and unblock using project_name, environment_name, and job_name ASAP. I would create a new PR with the suggested enhancements.

New PR that includes the changes described above: #46184

Feel free to select just one of my PRs and close the other one.

@ginone ginone force-pushed the optional-dbt-cloud-job-operator-params branch from 2457ee9 to f41e5b3 Compare January 30, 2025 15:00
@Lee-W
Copy link
Member

Lee-W commented Feb 3, 2025

@josh-fell @Lee-W thank you for the code review, I've addressed all feedback. Could you please look again?

BTW. What do you think regarding @jaklan suggestion (#45634 (review))? My thoughts:

1. this (implicit argument types) feels wrong/confusing:
self.job: str | int,
self.project: str | int,
self.environment: str | int,

I think the params should be explicit, so I would go with:

self.job_id: int,
self.project_id: int,
self.environment_id: int,
self.job_name: str,
self.project_name: str,
self.environment_name: str,
2. I see this as an enhancement that could allow more flexible setups, but I would prefer to release this PR "as is" and unblock using `project_name`, `environment_name`, and `job_name` ASAP. I would create a new PR with the suggested enhancements.

Hi, sorry for the late reply. I prefer this idea more. It would be great if you could rebase from the main branch and resolve the conflict, then I'll review it again. Thanks a lot!

@ginone ginone force-pushed the optional-dbt-cloud-job-operator-params branch from f41e5b3 to c4d8752 Compare February 4, 2025 08:52
@ginone
Copy link
Contributor Author

ginone commented Feb 4, 2025

@Lee-W rebasing is done, please have a look again. Thanks!

Copy link
Member

@Lee-W Lee-W left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for your prompt reply. I'll keep it open for one to two days so others can take a look.

@ginone ginone force-pushed the optional-dbt-cloud-job-operator-params branch from c4d8752 to 670aa4f Compare February 4, 2025 11:00
@ginone
Copy link
Contributor Author

ginone commented Feb 4, 2025

Looks good to me. Thanks for your prompt reply. I'll keep it open for one to two days so others can take a look.

Fixed one incorrect "include" path in docs. Checks should be all green now.

Thanks!

@potiuk potiuk merged commit ca77120 into apache:main Feb 4, 2025
61 checks passed
@boring-cyborg
Copy link

boring-cyborg bot commented Feb 4, 2025

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

niklasr22 pushed a commit to niklasr22/airflow that referenced this pull request Feb 8, 2025
* Update DbtCloudRunJobOperator and corresponding tests

* update documentation

* Fixes after manual tests

* add unit test for duplicate names

* Enhance DbtCloudHook to support name filtering in list_environments and list_jobs methods

* Improve clarity by renaming variables and enhancing comments

* Refactor dbt Cloud hook and operator error handling & update method signatures to use keyword-only arguments

* Correct get_job_by_name docstring

* Fix include path for dbt Cloud documentation

---------

Co-authored-by: Marcin Sitnik <marcin.sitnik@roche.com>
ambika-garg pushed a commit to ambika-garg/airflow that referenced this pull request Feb 17, 2025
* Update DbtCloudRunJobOperator and corresponding tests

* update documentation

* Fixes after manual tests

* add unit test for duplicate names

* Enhance DbtCloudHook to support name filtering in list_environments and list_jobs methods

* Improve clarity by renaming variables and enhancing comments

* Refactor dbt Cloud hook and operator error handling & update method signatures to use keyword-only arguments

* Correct get_job_by_name docstring

* Fix include path for dbt Cloud documentation

---------

Co-authored-by: Marcin Sitnik <marcin.sitnik@roche.com>
@ginone ginone deleted the optional-dbt-cloud-job-operator-params branch July 30, 2025 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants