Should manually-triggered DAGs override dag_run.logical_date to equal data_interval_start? #22232

Gollum999 · 2022-03-09T22:38:20Z

Gollum999
Mar 9, 2022

Apache Airflow version

2.2.3

What happened

When triggering a DAG manually (via the web or via airflow dags trigger), some template params like ds, ts, and others derived from dag_run.logical_date will be set to the specified execution timestamp. This is inconsistent with automated runs where those fields are set to data_interval_start. This behavior contradicts the documentation in a few places, and can cause tasks that depend on those template params to behave unintuitively.

What you expected to happen

I expected ds to always equal data_interval_start. Quoting the docs in a few different places (emphasis mine):

DAG Runs: Data Interval

The “logical date” (also called execution_date in Airflow versions prior to 2.2) of a DAG run, for example, denotes the start of the data interval, not when the DAG is actually executed.

FAQ: What does execution_date mean?

Note that ds (the YYYY-MM-DD form of data_interval_start) refers to date string, not date start as may be confusing to some.

However, it's worth noting that DAGs: Running DAGs does seem to explain this edge case:

For example, if a DAG run is manually triggered by the user, its logical date would be the date and time of which the DAG run was triggered, and the value should be equal to DAG run’s start date. However, when the DAG is being automatically scheduled, with certain schedule interval put in place, the logical date is going to indicate the time at which it marks the start of the data interval, where the DAG run’s start date would then be the logical date + scheduled interval.

How to reproduce

Example DAG:

#!/usr/bin/env python3
from datetime import datetime

from airflow import DAG
from airflow.operators.bash import BashOperator


default_args = {
    'retries': 0,
}
with DAG(
        'test_dag',
        default_args=default_args,
        schedule_interval='@weekly',
        start_date=datetime(2022, 1, 1),
        catchup=False,
) as dag:
    BashOperator(task_id='task', bash_command="""echo "
        ds: {{ ds }}
        prev_ds: {{ prev_ds }}
        next_ds: {{ next_ds }}
        ts: {{ ts }}
        execution_date: {{ execution_date }}
        data_interval_start: {{ data_interval_start }}
        data_interval_end: {{ data_interval_end }}
        dag_run.logical_date {{ dag_run.logical_date }}
    "
    """)

Trigger this dag via web or via airflow dags trigger test_dag -e <some timestamp>, then look at output in the logs.

Example output for an automated run:

[2022-03-08, 10:31:21 CST] {subprocess.py:89} INFO -         ds: 2022-02-27
[2022-03-08, 10:31:21 CST] {subprocess.py:89} INFO -         prev_ds: 2022-02-20
[2022-03-08, 10:31:21 CST] {subprocess.py:89} INFO -         next_ds: 2022-03-06
[2022-03-08, 10:31:21 CST] {subprocess.py:89} INFO -         ts: 2022-02-27T00:00:00+00:00
[2022-03-08, 10:31:21 CST] {subprocess.py:89} INFO -         execution_date: 2022-02-27T00:00:00+00:00
[2022-03-08, 10:31:21 CST] {subprocess.py:89} INFO -         data_interval_start: 2022-02-27T00:00:00+00:00
[2022-03-08, 10:31:21 CST] {subprocess.py:89} INFO -         data_interval_end: 2022-03-06T00:00:00+00:00
[2022-03-08, 10:31:21 CST] {subprocess.py:89} INFO -         dag_run.logical_date 2022-02-27 00:00:00+00:00

Example output for a manually-triggered run:

[2022-03-08, 10:31:27 CST] {subprocess.py:89} INFO -         ds: 2022-03-08
[2022-03-08, 10:31:27 CST] {subprocess.py:89} INFO -         prev_ds: 2022-03-08
[2022-03-08, 10:31:27 CST] {subprocess.py:89} INFO -         next_ds: 2022-03-08
[2022-03-08, 10:31:27 CST] {subprocess.py:89} INFO -         ts: 2022-03-08T22:23:58+00:00
[2022-03-08, 10:31:27 CST] {subprocess.py:89} INFO -         execution_date: 2022-03-08T22:23:58+00:00
[2022-03-08, 10:31:27 CST] {subprocess.py:89} INFO -         data_interval_start: 2022-02-27T00:00:00+00:00
[2022-03-08, 10:31:27 CST] {subprocess.py:89} INFO -         data_interval_end: 2022-03-06T00:00:00+00:00
[2022-03-08, 10:31:27 CST] {subprocess.py:89} INFO -         dag_run.logical_date 2022-03-08 22:23:58+00:00

Operating System

CentOS 7.4

Versions of Apache Airflow Providers

Only the defaults.

Deployment

Other

Deployment details

Just running processes locally.

Anything else

I'm not convinced that this is just a documentation issue; the fact that logical_date and all derived fields can have contextually different meanings seems fundamentally broken to me. To keep my users from running into issues, I feel like I am forced to teach them either "never use ds/ts/etc." or "never trigger DAGs manually", neither of which feels great.

As far as I can tell, there is no way to manually trigger a dag and have it behave exactly like a "normal" automated run since ds will always fall outside of the data interval. Which begs the question: What does it even mean to manually trigger a DAG Run when data intervals are involved? It shouldn't be able to affect the existing schedule, so the current behavior of "snapping" to the latest complete data interval makes sense to me. But for consistency, I think all dag_run fields (except for things like run_id) should follow that same behavior.

Alternatively, maybe there are two classes of DAGs: Ones that operate on data intervals, and ones that operate on a single instant in time (e.g. schedule_interval=None). And perhaps the former should never be manually triggered and should only ever use something like airflow dags backfill to run specific intervals. And ideally the web and CLI would reflect this to prevent running a DAG "the wrong way".

Admittedly I am new to Airflow, so maybe my intuitions are not correct. And I recognize that there are almost certainly some users that depend on the current behavior, so it would definitely be a pain to change. But I'm curious to hear if other people have thoughts about this or specific examples of why the current behavior is desirable.

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

potiuk · 2022-03-14T00:05:15Z

potiuk
Mar 14, 2022
Collaborator

As far as I look at it, manual and "scheduled" runs are fundamentally different. They even have different custom time-table branches definition (when custom time-tables are defined).

I am not sure what @uranusjr and @ashb @malthe think about it, but while the current semantics is not 100% accurate, there is simply no sematics that is and it is "good enough". And maybe that is a sign we should actually split them totally and make it obvious in the interface. The question about mixing "scheduled" and "manual" runs has been raised a few times in the past and maybe we could improve it ien the way to be less confusing.

But unless there is some concrete proposal how this could be improved - this is mostly academic discussion, I am afraid.

1 reply

uranusjr Mar 14, 2022
Collaborator

I don’t think it’s worthwhile to change the semantic of logical date (which has backward compatibility issues). The logical date is inheritantly just a value without much semantic guarantees beyond being an identifier for the run. Instead of stretching it beyond the designed usage, you should use any of the new variables designed to imply those semantics.

Also note that we did attempt to change ds and ts in 2.2 (to match data_interval_start instead of logical_date), but had to roll the change back afterwards due to numerous incompatibility reports.

dstuck · 2022-07-12T14:44:15Z

dstuck
Jul 12, 2022

My perspective is that the problem is with how data_interval is set to just set to the most recent interval. The behavior I really want is the ability to kick off a dag run manually and specify the start and end date. It seems like a fairly common use case with intervals to say run daily, but then maybe need to go back and rerun a 7 day period. That could be addressed with the backfill feature, but isn't as flexible like if you're delivering quarterly but then need re-deliver a one day period

3 replies

potiuk Jul 12, 2022
Collaborator

Those should be simply different DAGs. You can easily dynamically create even largely the sams multipe DAGs using single Python file. One could be monthly, one coudl bey daily, one quarterly and they could operate on the same data. This is super flexible, because you are free to have as much common and as much different processing in those DAGs.

For example you can very easily write a python file that will generate largely the same set of independent tasks, but then, daily results might be sent via slack, weekly by mail and quarterly stored in GCS - and your python code will generate three almost identical dags differing only by the last task. And this is a way better approach because then you can separately see history of the daily, weekly, quarterly runs, you know when it has been run and re-run, they do not override each-other's history. And you can continue re-running daily dags while you don't rerun weekly etc.

You have to stop thinking in terms of "what can i do with that single DAG" but start thinking "how flexibly I can put together the same logic in multiple dags via Python code". It's WAY more flexible than anything you could imagine in the form of "what Can I do with the same DAG". Actually it is as flexible as it gets. It's impossible to be more flexible because (unlike any declarative approach you can do with your single DAG) Python code and DAG generation is Turing -complete and you can do anything with it.

dstuck Jul 15, 2022

I think this really gets to the question raised in the initial post: "What does it even mean to manually trigger a DAG Run when data intervals are involved?"

The case I'm making is that if the airflow framework supports writing python code that operates on a data interval, completely independently from the concept of a schedule. Further, the framework explicitly supports kicking off manual runs with a given execution date. In v1, this was all that mattered since intervals were implicitly determined by execution date + schedule interval, but the huge improvement in v2 of making date interval explicit seems to have missed updating the behavior of manual triggers which are still stuck using a very confusing combination of execution date + schedule interval again rather then supporting specifying the execution date and data interval explicitly.

I think the point that it's easy to do whatever you want because you're writing code is ignoring the fact that statically adding new dags requires actually running a code deployment. This is obviously not a trivial matter, and seems particularly unreasonable when you realize that the exact logic to run the process is already deployed to the instance and simply needs the ability to manually specify the data interval in addition to execution date.

potiuk Jul 16, 2022
Collaborator

Sure. I think you can follow this up if you want. What it requires is to bring the case to devlist https://airflow.apache.org/community/. and arguing and explaining what you think is a good solution, getting to consensus and creating and Airlfow Improvement Proposal https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals and eventually implementing it. This is how all big changes in Airflow behaviour got implemented and this one is not an exception.

Like anyone in the community - you are free to pursue this path.

kuixiang · 2022-12-26T14:20:37Z

kuixiang
Dec 26, 2022

when you backfill the dag , the AIRFLOW_CTX_EXECUTION_DATE looks discontinuous,if you set core.default_timezone='Asia/Shanghai' for example.
when you transport the macro parameter like {{data_interval_start }} or {{ds}} to your bash job, like :
you will get inconsistent behavior in daily scheduling and manually scheduling

bash test.py --start {{data_interval_start }}

so my solution is

--date {{ (data_interval_start + macros.timedelta(hours=8)).strftime("%Y-%m-%d") }}

But the inconsistency between “backfill” and “daily scheduling” still not resolved，why it's so complicated...

3 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should manually-triggered DAGs override dag_run.logical_date to equal data_interval_start? #22232

{{title}}

Replies: 3 comments 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Should manually-triggered DAGs override dag_run.logical_date to equal data_interval_start? #22232

Gollum999 Mar 9, 2022

Apache Airflow version

What happened

What you expected to happen

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

Replies: 3 comments · 7 replies

potiuk Mar 14, 2022 Collaborator

uranusjr Mar 14, 2022 Collaborator

dstuck Jul 12, 2022

potiuk Jul 12, 2022 Collaborator

dstuck Jul 15, 2022

potiuk Jul 16, 2022 Collaborator

kuixiang Dec 26, 2022

kuixiang Dec 26, 2022

kuixiang Dec 26, 2022

kuixiang Dec 26, 2022

Gollum999
Mar 9, 2022

Replies: 3 comments 7 replies

potiuk
Mar 14, 2022
Collaborator

uranusjr Mar 14, 2022
Collaborator

dstuck
Jul 12, 2022

potiuk Jul 12, 2022
Collaborator

potiuk Jul 16, 2022
Collaborator

kuixiang
Dec 26, 2022