Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airflow Kafka Provider "commit_cadence" Not Working as Expected #34213

Open
1 of 2 tasks
ahipp13 opened this issue Sep 8, 2023 · 7 comments
Open
1 of 2 tasks

Airflow Kafka Provider "commit_cadence" Not Working as Expected #34213

ahipp13 opened this issue Sep 8, 2023 · 7 comments

Comments

@ahipp13
Copy link

ahipp13 commented Sep 8, 2023

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

When running the Airflow Kafka Provider Operator "ConsumeFromTopicOperator", I had one of my runs fail. Naturally since I have the "commit_cadence" option set to "end_of_operator", I was expecting to have duplicate records since it should have not commit the offset because the operator failed. Well the day ended and my counts were off, and when I looked in my DB I found that during the time it failed is when it missed the messages. So when the DAG run failed the offset was for some reason still committed even though I had set it to "end_of_operator".

What you think should happen instead

Based on your description, the offset should not get committed until the operator has completed successfully. If the DAG fails, it should go back to the offset the operator started on.

How to reproduce

Run the Kafka Provider on a topic and mid DAG run fail it, and see if it goes back and gets the messages it missed. The connection information I used is:

{
"bootstrap.servers": SERVERS,
"group.id": GROUPID,
"auto.offset.reset": "earliest",
"security.protocol": "SSL",
"ssl.ca.location": "CA",
"ssl.certificate.location": "CERT",
"ssl.key.location": "KEY",
"ssl.key.password": "PW"
}

Operating System

PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"

Versions of Apache Airflow Providers

apache-airflow-providers-apache-kafka==1.1.2

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else

Looking through the Confluent Kafka Documentation, I suspect what is happening here is because for Confluent's consumers they have an option "enable.auto.commit" that defaults to true, and it commits the offset every 5 seconds (https://docs.confluent.io/platform/current/clients/consumer.html#id1). When I turned this option to false, it worked as expected and I was getting duplicate messages on fails.

I don't really know what the expected behavior here is, but either 1) the code should be changed to turn this option off in the source code or 2) the documentation should specifically say that you need to turn this option to false in order for the commit_cadence option to work.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@ahipp13 ahipp13 added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Sep 8, 2023
@Taragolis
Copy link
Contributor

Taragolis commented Sep 8, 2023

I don't really know what the expected behavior here is, but either

  1. the code should be changed to turn this option off in the source code

Do you know which part of apache-airflow-providers-apache-kafka should be changed?

or 2) the documentation should specifically say that you need to turn this option to false in order for the commit_cadence option to work.

Seems like you found solution and exactly know what should be done in case of usage Confluence Kafka. So maybe you could contribute this part into provider documentation? It could be easily done by click on Suggest a change on this page in https://airflow.apache.org/docs/apache-airflow-providers-apache-kafka/stable/operators/index.html ?

@Taragolis Taragolis added kind:documentation good first issue and removed needs-triage label for new issues that we didn't triage yet labels Sep 8, 2023
@AmirAflak
Copy link

would like to take this :)

@ahipp13
Copy link
Author

ahipp13 commented Sep 8, 2023

@Taragolis I do not know what part should be changed, but I feel as if something should. Basically while the "enable.auto.commit" option is on, it doesn't really matter what you put in the "commit_cadence" option in the operator, because its going to commit the offset every 5 seconds by default.
To me it feels like there needs to be another option for the operator to specify whether you want to auto commit or not and the interval, and if you don't want to auto commit then you use the "commit_cadence" option.
I would help but am too busy currently, I just wanted to alert this to everybody incase somebody else had the same discovery. I am sure the smart people on here can come up with a good solution :)

@Taragolis
Copy link
Contributor

Or maybe solution already exists and you could provide required parameters to Consumer thought connection?
https://airflow.apache.org/docs/apache-airflow-providers-apache-kafka/stable/connections/kafka.html#configuring-the-connection

@AmirAflak
Copy link

Or maybe solution already exists and you could provide required parameters to Consumer thought connection? https://airflow.apache.org/docs/apache-airflow-providers-apache-kafka/stable/connections/kafka.html#configuring-the-connection
@ahipp13
Agreed, in this case you have to specify "enable.auto.commit": False in the extra field of Connection.
have a look for potential examples :
https://github.com/search?q=repo%3Aapache%2Fairflow+enable.auto.commit&type=code

@AmirAflak
Copy link

@Taragolis but if user did not specify that, "enable.auto.commit" would be on by default and in this case commit_cadence selection would be redundant.

@wheelsapk
Copy link

Visit the SASSA website and log into your account if you have Sassa status. You can usually check the status of your benefits and claims online. https://sassacheckup.co.za/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants