Skip to content

Conversation

@sjyangkevin
Copy link
Contributor

Motivation

As described in #49651, when the access control for a DAG is set to an non-exist role, the DAG import error show up in Airflow UI for a while and then disappear. The update is to fix this issue, and let the import error persist in the metadata DB until the DAG is updated with a correct access control setting.

Close #49651

What is the issue

def _serialize_dag_capturing_errors(

When the DAG's access control is set to a non-exist role, the following process will raise an Exception "Failed to write serialized DAG dag_id=...". So, how this exception is triggered?

  1. dag_was_updated will be True when the first time SerializedDagModel.write_dag write the serialized DAG to the database.
  2. when dag_was_updated is True, _sync_dag_perms will be triggered to sync DAG specific permissions. At the moment, it detects that the role doesn't exist, and raise an error, resulting in the exception.
  3. This exception will be captured, and being logged as an import error temporary in the DB, and show up in the UI.

From my understanding, this sync process will run for every MIN_SERIALIZED_DAG_UPDATE_INTERVAL. So, what happen in the second run.

  1. dag_was_updated will be False since the DAG code is not updated.
  2. In this case, _sync_dag_perms will NOT BE TRIGGERED even though in the access control is set incorrectly in the DAG code.
  3. Therefore, no exception will be raised, and no import error will be logged. Therefore, the import error is removed from the DB, as well as from the UI.

What is the fix

In the current state, _sync_dag_perms runs only when the DAG is updated (i.e., dag_was_updated is True). This can be more performant because it doesn't run for all the DAGs. However, it cannot properly handle the sync for permissions. Therefore, the current fix is to make _sync_dag_perms run for all the DAGs during the DAG sync process. I understand it might not be an ideal fix, but I wasn't able to find a better solution due to my limited understanding on the code. I would really appreciate if anyone could suggest some ideas to further improve it.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@eladkal eladkal added this to the Airflow 3.0.3 milestone Jun 8, 2025
@eladkal eladkal added type:bug-fix Changelog: Bug Fixes backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch labels Jun 8, 2025
@sjyangkevin sjyangkevin changed the title Solve Certain DAG import errors ("role does not exist") don't persist in Airflow Fix Certain DAG import errors ("role does not exist") don't persist in Airflow Jun 9, 2025
@uranusjr uranusjr self-requested a review June 11, 2025 06:35
Copy link
Member

@jason810496 jason810496 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! Could you add the corresponding unit test?

@sjyangkevin
Copy link
Contributor Author

Thanks for the fix! Could you add the corresponding unit test?

Thanks for the feedback. I am wondering if it’s a proper fix but will add unit tests for it.

@jason810496
Copy link
Member

Thanks for the fix! Could you add the corresponding unit test?

Thanks for the feedback. I am wondering if it’s a proper fix but will add unit tests for it.

The investigation seems reasonable to me, but still need second eye on it to confirm.

@kaxil kaxil modified the milestones: Airflow 3.0.3, Airflow 3.0.4 Jul 2, 2025
@sjyangkevin
Copy link
Contributor Author

Hi @jason810496 , I am wondering if you know how to enable FabAuthManager in the breeze environment. I found the test test_sync_perms_syncs_dag_specific_perms_on_update is skipped because the auth_manager is set to SimpleAuthManager while running the test. I tried to include an environment variable into files/airflow-breeze-config/environment_variables.env, but it didn't work out. Also, I tried to run breeze start-airflow --auth-manager FabAuthManager. However, in the breeze environment, Airflow DAGs and airflow config get-value core auth_manager return the Fab auth manager, but in the test, the auth manager is still SimpleAuthManager when I log it in other test that doesn't have the skip condition. Thanks.

Screenshot from 2025-07-03 00-07-10

@jason810496
Copy link
Member

jason810496 commented Jul 3, 2025

Hi @jason810496 , I am wondering if you know how to enable FabAuthManager in the breeze environment. I found the test test_sync_perms_syncs_dag_specific_perms_on_update is skipped because the auth_manager is set to SimpleAuthManager while running the test. I tried to include an environment variable into files/airflow-breeze-config/environment_variables.env, but it didn't work out. Also, I tried to run breeze start-airflow --auth-manager FabAuthManager. However, in the breeze environment, Airflow DAGs and airflow config get-value core auth_manager return the Fab auth manager, but in the test, the auth manager is still SimpleAuthManager when I log it in other test that doesn't have the skip condition. Thanks.

Screenshot from 2025-07-03 00-07-10

How about configuring auth manager as FABAuthManager ( AIRFLOW__CORE__AUTH_MANAGER=airflow.providers.fab.auth_manager.fab_auth_manager.FabAuthManager ) in files/airflow-breeze-config/environment_variables.env ?

Customizing Breeze startup
--------------------------
When you enter the Breeze environment, automatically an environment file is sourced from
``files/airflow-breeze-config/environment_variables.env``.

@sjyangkevin
Copy link
Contributor Author

How about configuring auth manager as FABAuthManager ( AIRFLOW__CORE__AUTH_MANAGER=airflow.providers.fab.auth_manager.fab_auth_manager.FabAuthManager ) in files/airflow-breeze-config/environment_variables.env ?

Customizing Breeze startup
--------------------------
When you enter the Breeze environment, automatically an environment file is sourced from
``files/airflow-breeze-config/environment_variables.env``.

Yeah, I’ve tried that but seems like it’s not picked up, I can double check if there is any typo.

@sjyangkevin
Copy link
Contributor Author

I think the test still not recognize the FabAuthManager, or we need to do something different to set it up. I will raise this question in slack.
Screenshot from 2025-07-04 17-11-38

@potiuk
Copy link
Member

potiuk commented Jul 5, 2025

answered on slack :), --keep-env-variables is the key.

@sjyangkevin sjyangkevin force-pushed the issues/49651/dag-import-error-missing branch from 5cdbca5 to c8d4b27 Compare July 6, 2025 21:55
@sjyangkevin
Copy link
Contributor Author

Hi all, I've added/updated related test cases for this change. I also attached a DAG which can be used to test the issue. In the DAG, non_existing_role is an invalid (i.e., non-existing) role. It would be great if the "full tests needed" and "all versions" tags can be added to this PR as a safeguard.

The Import Error is shown until the erroring access control configuration is fixed.
Screenshot from 2025-07-06 18-06-51

In the metadata store, the import_error table contains only one record for this import error.
Screenshot from 2025-07-06 18-06-27

The import error is also tracked in the dag table.
Screenshot from 2025-07-06 18-06-36

DAG Code

import pendulum

from airflow.sdk import DAG
from airflow.providers.standard.operators.empty import EmptyOperator

with DAG(
    dag_id="example_fine_grained_access",
    start_date=pendulum.datetime(2021, 2, 1, tz="UTC"),
    access_control={
        "non_existing_role": {"can_edit", "can_read", "can_delete"},
    },
):
    task = EmptyOperator(task_id="task")
    task2 = EmptyOperator(task_id="task2")

@sjyangkevin sjyangkevin requested a review from jason810496 July 6, 2025 22:14
@jason810496 jason810496 added full tests needed We need to run full set of tests for this PR to merge all versions If set, the CI build will be forced to use all versions of Python/K8S/DBs labels Jul 7, 2025
@jason810496 jason810496 closed this Jul 7, 2025
@jason810496 jason810496 reopened this Jul 7, 2025
Copy link
Member

@jason810496 jason810496 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for the update.

It would be great if the "full tests needed" and "all versions" tags can be added to this PR as a safeguard.

I just added the tags and trigger the CI again by close and reopen the PR.

@sjyangkevin sjyangkevin force-pushed the issues/49651/dag-import-error-missing branch from c8d4b27 to 7c42b56 Compare July 18, 2025 02:53
@sjyangkevin sjyangkevin requested a review from jason810496 July 18, 2025 03:15
Copy link
Member

@jason810496 jason810496 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for the fix.

@sjyangkevin
Copy link
Contributor Author

Nice! Thanks for the fix.

Thank you for the time to review!

@eladkal eladkal force-pushed the issues/49651/dag-import-error-missing branch from 7c42b56 to 613017a Compare August 2, 2025 07:17
@eladkal
Copy link
Contributor

eladkal commented Aug 2, 2025

@jason810496 is it ready to merge?

@sjyangkevin
Copy link
Contributor Author

@jason810496 is it ready to merge?

Hi @eladkal , thanks for looking at it. I think it will be great if we could have a second eyes on it, as Jason mentioned in previous discussion.

the caveat for this fix is that it runs the permission sync for ALL DAGs regardless those are updated or not. If we don’t check in this way, the current implementation will miss those permission error when dag processor run again, as discussed here #51511 (comment)

@Lee-W Lee-W self-requested a review August 6, 2025 03:20
Copy link
Member

@Lee-W Lee-W left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just want to make sure whether the description still consist with the latest code after a second read, it seems to be 🤔

Copy link
Member

@Lee-W Lee-W left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are concerned about performance issues, should we conduct some experiments to see how it affects performance? If it doesn't affect much, I think we're good to merge

@sjyangkevin
Copy link
Contributor Author

just want to make sure whether the description still consist with the latest code after a second read, it seems to be 🤔

Yes, I think it’s still consistent.

@sjyangkevin
Copy link
Contributor Author

If we are concerned about performance issues, should we conduct some experiments to see how it affects performance? If it doesn't affect much, I think we're good to merge

Right. Wondering if there is anyway in breeze we can profile DAG processor’s performance. I will look into it and probably create a certain number of DAGs to see how the performance is impacted.

@sjyangkevin
Copy link
Contributor Author

sjyangkevin commented Aug 8, 2025

If we are concerned about performance issues, should we conduct some experiments to see how it affects performance? If it doesn't affect much, I think we're good to merge

I attempted to create 100 DAGs (physically created DAG files) with their access control being configured with an non-existing role. Then, I enabled statsd integration when running Airflow in breeze and collect the following metrics.

I think the data isn't clean up after breeze down, and breeze start-airflow, so I use a red line to distinguish the two periods.

image

The time period before the red line

The metrics for this period were collected when _sync_dag_perms was executed, regardless of whether a DAG was updated (after applying the change introduced in this PR). The number of times we scanned the filesystem and queued all existing DAGs increased by roughly 1,000. However, there was no significant change in either the “Number of DAG files to be considered for the next scan” (after the first scan, but frequency seems to increase) or the “Total Parse Time.”

Not sure if these metrics can accurately describe the performance considerations. I can continual to learn more about it, but hope it can somehow help us to understand the performance impact (or probably I should create more DAGs, like 1,000?).

@sjyangkevin sjyangkevin force-pushed the issues/49651/dag-import-error-missing branch from 613017a to 83c821b Compare August 8, 2025 03:09
@sjyangkevin sjyangkevin force-pushed the issues/49651/dag-import-error-missing branch from 83c821b to b0ecc2e Compare August 8, 2025 15:21
@eladkal eladkal modified the milestones: Airflow 3.0.4, Airflow 3.0.5 Aug 8, 2025
@jason810496 jason810496 merged commit cdab6d7 into apache:main Aug 12, 2025
180 checks passed
@github-actions
Copy link

Backport failed to create: v3-0-test. View the failure log Run details

Status Branch Result
v3-0-test Commit Link

You can attempt to backport this manually by running:

cherry_picker cdab6d7 v3-0-test

This should apply the commit to the v3-0-test branch and leave the commit in conflict state marking
the files that need manual conflict resolution.

After you have resolved the conflicts, you can continue the backport process by running:

cherry_picker --continue

@sjyangkevin
Copy link
Contributor Author

Interested in this back port issue, probably because the current main branch is targeting 3.1.0 🤔 , please let me know if I could help with anything to resolve it.

@Lee-W
Copy link
Member

Lee-W commented Aug 12, 2025

Interested in this back port issue, probably because the current main branch is targeting 3.1.0 🤔 , please let me know if I could help with anything to resolve it.

just follow #51511 (comment) and try to resolve the conflict. feel free to tag me if you encounter any issue :)

@sjyangkevin
Copy link
Contributor Author

Interested in this back port issue, probably because the current main branch is targeting 3.1.0 🤔 , please let me know if I could help with anything to resolve it.

just follow #51511 (comment) and try to resolve the conflict. feel free to tag me if you encounter any issue :)

Thanks! Will do by end of today or tomorrow. I think I should also rebase first before following the instruction?

@Lee-W
Copy link
Member

Lee-W commented Aug 12, 2025

Interested in this back port issue, probably because the current main branch is targeting 3.1.0 🤔 , please let me know if I could help with anything to resolve it.

just follow #51511 (comment) and try to resolve the conflict. feel free to tag me if you encounter any issue :)

Thanks! Will do by end of today or tomorrow. I think I should also rebase first before following the instruction?

git checkout origin/v3-0-test first

sjyangkevin added a commit to sjyangkevin/airflow that referenced this pull request Aug 13, 2025
…'t persist in Airflow (apache#51511)

* update _sync_dag_perms to run for all dags

* add and update test cases

* precommit fix

* refactor mark to variable for reusability
(cherry picked from commit cdab6d7)

Co-authored-by: Kevin Yang <85313829+sjyangkevin@users.noreply.github.com>
kaxil pushed a commit that referenced this pull request Aug 13, 2025
…'t persist in Airflow (#51511) (#54432)

### Motivation
As described in #49651, when the access control for a DAG is set to an non-exist role, the DAG import error show up in Airflow UI for a while and then disappear. The update is to fix this issue, and let the import error persist in the metadata DB until the DAG is updated with a correct access control setting.

Close #49651 

### What is the issue

https://github.com/apache/airflow/blob/56fbe90a8d0b56558c02d75f4ac5852e041cb058/airflow-core/src/airflow/dag_processing/collection.py#L177

When the DAG's access control is set to a non-exist role, the following process will raise an Exception "Failed to write serialized DAG dag_id=...". So, how this exception is triggered?
1. `dag_was_updated` will be `True` when the first time `SerializedDagModel.write_dag` write the serialized DAG to the database.
2. when `dag_was_updated` is `True`, `_sync_dag_perms` will be triggered to sync DAG specific permissions. At the moment, it detects that the role doesn't exist, and raise an error, resulting in the exception.
3. This exception will be captured, and being logged as an import error temporary in the DB, and show up in the UI. 

From my understanding, this sync process will run for every `MIN_SERIALIZED_DAG_UPDATE_INTERVAL`. So, what happen in the second run.
1. `dag_was_updated` will be `False` since the DAG code is not updated.
2. In this case, `_sync_dag_perms` will **NOT BE TRIGGERED** even though in the access control is set incorrectly in the DAG code.
3. Therefore, no exception will be raised, and no import error will be logged. Therefore, the import error is removed from the DB, as well as from the UI.

### What is the fix

In the current state, `_sync_dag_perms` runs only when the DAG is updated (i.e., `dag_was_updated` is `True`). This can be more performant because it doesn't run for all the DAGs. However, it cannot properly handle the sync for permissions. Therefore, the current fix is to make `_sync_dag_perms` run for all the DAGs during the DAG sync process. I understand it might not be an ideal fix, but I wasn't able to find a better solution due to my limited understanding on the code. I would really appreciate if anyone could suggest some ideas to further improve it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

all versions If set, the CI build will be forced to use all versions of Python/K8S/DBs area:DAG-processing backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch full tests needed We need to run full set of tests for this PR to merge type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Certain DAG import errors ("role does not exist") don't persist in Airflow

6 participants