Skip to content

Conversation

@turbaszek
Copy link
Member

@turbaszek turbaszek commented Feb 5, 2020

The PR changes numerous single selects / updates in base, scheduler, and backfill jobs to bulk operations.

Measure

To measure the number of SQL queries I required two things:

  • SQLAlchemy callbacks that log query information. This callback can be turned on / off by specifying sqlalchemy_stats in debug section of Airflow config.
@SQLALCHEMY 0.028488636016845703 |$ 'create_session':/opt/airflow/airflow/utils/session.py:32 |$ cli_action_loggers.py:default_action_log>contextlib.py:__exit__>session.py:create_session |$  INSERT INTO log (dttm,...
  • Script to run successive SchedulerJob and to generate .csv with all queries per job. The script is located in scripts/perf/sql_queries.py and it works out of the box in the breeze environment.

I am using SequentialExecutor and scripts/perf/dags/sql_perf_dag.py and local Postgres.

Result

Screenshot 2020-02-07 at 15 08 17
Screenshot 2020-02-07 at 15 08 31

On the average, the number of queries was reduced by ~14% and time fo queries by ~10.9%.

Thanks for helping to @evgenyshulman from Databand!


Issue link: AIRFLOW-6590

Make sure to mark the boxes below before creating PR: [x]

  • Description above provides context of the change
  • Commit message/PR title starts with [AIRFLOW-NNNN]. AIRFLOW-NNNN = JIRA ID*
  • Unit tests coverage for changes (not needed for documentation changes)
  • Commits follow "How to write a good git commit message"
  • Relevant documentation is updated including usage instructions.
  • I will engage committers as explained in Contribution Workflow Example.

* For document-only changes commit message can start with [AIRFLOW-XXXX].


In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.

@boring-cyborg boring-cyborg bot added area:CLI area:dev-tools area:Scheduler including HA (high availability) scheduler labels Feb 5, 2020
The PR changes numerous single selects / updates in base,
scheduler, and backfill jobs to bulk operations.
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment to Exception raised.

@turbaszek turbaszek marked this pull request as ready for review February 7, 2020 14:33
@turbaszek turbaszek force-pushed the AIRFLOW-6590-batch-updates branch from 38f1feb to a05147f Compare February 7, 2020 15:44
@codecov-io
Copy link

codecov-io commented Feb 7, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@f3eea3e). Click here to learn what that means.
The diff coverage is 16.66%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #7370   +/-   ##
=========================================
  Coverage          ?   86.31%           
=========================================
  Files             ?      873           
  Lines             ?    40769           
  Branches          ?        0           
=========================================
  Hits              ?    35188           
  Misses            ?     5581           
  Partials          ?        0
Impacted Files Coverage Δ
airflow/cli/commands/task_command.py 72.12% <ø> (ø)
airflow/jobs/scheduler_job.py 89.35% <0%> (ø)
airflow/models/dagrun.py 96.47% <10%> (ø)
airflow/jobs/backfill_job.py 92.09% <13.33%> (ø)
airflow/jobs/base_job.py 91.6% <20%> (ø)
airflow/models/taskinstance.py 94.24% <28.57%> (ø)
airflow/utils/sqlalchemy.py 84.93% <30.76%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f3eea3e...543a91a. Read the comment docs.

@turbaszek turbaszek requested review from ashb and potiuk February 7, 2020 18:45
@ashb
Copy link
Member

ashb commented Feb 10, 2020

Nice work - taking a look now.

@potiuk
Copy link
Member

potiuk commented Feb 10, 2020

Please rebase to latest master @nuclearpinguin - we had a six drama on Travis but it's solved now.

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change!

A couple of style points etc, and one question about the change to the tests.

@turbaszek turbaszek force-pushed the AIRFLOW-6590-batch-updates branch from f831a0e to fb8500b Compare February 11, 2020 15:13
@turbaszek turbaszek force-pushed the AIRFLOW-6590-batch-updates branch from fb8500b to 543a91a Compare February 11, 2020 15:33
@turbaszek turbaszek requested a review from ashb February 11, 2020 15:34
@turbaszek turbaszek merged commit fb00c68 into apache:master Feb 14, 2020
# them in the executor.
simple_task_instances = [SimpleTaskInstance(ti) for ti in
tis_to_set_to_queued]
simple_task_instances = [SimpleTaskInstance(ti) for ti in tis_to_set_to_queued]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nuclearpinguin I believe that this should be before session.commit().

Session.commit() should flush the session, and then sqlalchemy will need to requery each TI individually to reconstruct the STI

galuszkak pushed a commit to FlyrInc/apache-airflow that referenced this pull request Mar 5, 2020
* [AIRFLOW-6590] Use batch db operations in jobs

The PR changes numerous single selects / updates in base,
scheduler, and backfill jobs to bulk operations.

* fixup! [AIRFLOW-6590] Use batch db operations in jobs

* fixup! fixup! [AIRFLOW-6590] Use batch db operations in jobs
@turbaszek turbaszek deleted the AIRFLOW-6590-batch-updates branch March 14, 2020 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:CLI area:dev-tools area:performance area:Scheduler including HA (high availability) scheduler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants