Skip to content

Conversation

@MaksYermak
Copy link
Contributor

In this PR I have added retry logic for KubernetesCreateResourceOperator and KubernetesJobOperator.

This logic is needed for preventing 'No agent available' error. The error appears time to time when users try to create a Resource or Job. This issue is inside Kubernetes and in the current moment has no solution. Like a temporary solution we decided to retry Job or Resource creation request each time when this error appears.

Link for the same issue for cert-manager service: cert-manager/cert-manager#6457


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Apr 23, 2024
@raphaelauv
Copy link
Contributor

why don't you use the internal retry parameter of airflow ?

@MaksYermak
Copy link
Contributor Author

why don't you use the internal retry parameter of airflow ?

I use the same approach what we use for retry Pod creation: https://github.com/apache/airflow/blob/main/airflow/providers/cncf/kubernetes/utils/pod_manager.py#L347C1-L356C1
Also, I didn't see any other approaches in the cncf package for retry functionality in this case I decided to use the same. Could you please share with me the example of Airflow's code for this internal retry parameter logic?

@raphaelauv
Copy link
Contributor

raphaelauv commented Apr 23, 2024

I was thinking about the BaseOperator argument retries

    PythonOperator(
        task_id="aa",
        retries=3,
        python_callable=toto,
    )

@raphaelauv
Copy link
Contributor

related : #15137

@vincbeck
Copy link
Contributor

I was thinking about the BasOperator argument retries

    PythonOperator(
        task_id="aa",
        retries=3,
        python_callable=toto,
    )

This is an option for users. If a user wants to retry a specific task, they can use this parameter. Here, if I understand correctly, @MaksYermak wants to retry without the user being aware or needing to do something.

Copy link
Contributor

@vincbeck vincbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add tests to cover these retries?

@MaksYermak MaksYermak force-pushed the add-retry-k8s-create-resource branch from f265b2a to 798a60d Compare May 7, 2024 10:30
@MaksYermak
Copy link
Contributor Author

Could you add tests to cover these retries?

Sure, I have added a unit tests.

@MaksYermak MaksYermak force-pushed the add-retry-k8s-create-resource branch from 798a60d to 1941e94 Compare May 7, 2024 11:17
@MaksYermak MaksYermak force-pushed the add-retry-k8s-create-resource branch from 1941e94 to 8e07ee0 Compare May 7, 2024 12:33
@VladaZakharova
Copy link
Contributor

Hi @raphaelauv @dirrao @vincbeck !
Can you please check the changes here again? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants