Commit cfec563
committed
Avoid scheduler/parser manager deadlock by using non-blocking IO
There have been long standing issues where the scheduler would "stop
responding" that we haven't been able to track down.
Someone was able to catch the scheduler in this state in 2.0.1 and
inspect it with py-spy (thanks, MatthewRBruce!)
The stack traces (slightly shortened) were:
```
Process 6: /usr/local/bin/python /usr/local/bin/airflow scheduler
Python v3.8.7 (/usr/local/bin/python3.8)
Thread 0x7FF5C09C8740 (active): "MainThread"
_send (multiprocessing/connection.py:368)
_send_bytes (multiprocessing/connection.py:411)
send (multiprocessing/connection.py:206)
send_callback_to_execute (airflow/utils/dag_processing.py:283)
_send_dag_callbacks_to_processor (airflow/jobs/scheduler_job.py:1795)
_schedule_dag_run (airflow/jobs/scheduler_job.py:1762)
Process 77: airflow scheduler -- DagFileProcessorManager
Python v3.8.7 (/usr/local/bin/python3.8)
Thread 0x7FF5C09C8740 (active): "MainThread"
_send (multiprocessing/connection.py:368)
_send_bytes (multiprocessing/connection.py:405)
send (multiprocessing/connection.py:206)
_run_parsing_loop (airflow/utils/dag_processing.py:698)
start (airflow/utils/dag_processing.py:596)
```
What this shows is that both processes are stuck trying to send data to
each other, but neither can proceed as both buffers are full, but since
both are trying to send, neither side is going to read and make more
space in the buffer. A classic deadlock!
The fix for this is two fold:
1) Enable non-blocking IO on the DagFileProcessorManager side.
The only thing the Manager sends back up the pipe is (now, as of 2.0)
the DagParsingStat object, and the scheduler will happily continue
without receiving these, so in the case of a blocking error, it is
simply better to ignore the error, continue the loop and try sending
one again later.
2) Reduce the size of DagParsingStat
In the case of a large number of dag files we included the path for
each and every one (in full) in _each_ parsing stat. Not only did the
scheduler do nothing with this field, meaning it was larger than it
needed to be, by making it such a large object, it increases the
likely hood of hitting this send-buffer-full deadlock case!1 parent 6e99ae0 commit cfec563
File tree
3 files changed
+107
-6
lines changed- airflow/utils
- tests/utils
3 files changed
+107
-6
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
141 | 141 | | |
142 | 142 | | |
143 | 143 | | |
144 | | - | |
| 144 | + | |
145 | 145 | | |
146 | 146 | | |
147 | 147 | | |
| |||
515 | 515 | | |
516 | 516 | | |
517 | 517 | | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
518 | 527 | | |
519 | 528 | | |
520 | 529 | | |
| |||
623 | 632 | | |
624 | 633 | | |
625 | 634 | | |
| 635 | + | |
626 | 636 | | |
627 | 637 | | |
628 | 638 | | |
| |||
696 | 706 | | |
697 | 707 | | |
698 | 708 | | |
699 | | - | |
| 709 | + | |
700 | 710 | | |
701 | 711 | | |
702 | 712 | | |
703 | | - | |
| 713 | + | |
| 714 | + | |
| 715 | + | |
| 716 | + | |
| 717 | + | |
| 718 | + | |
| 719 | + | |
| 720 | + | |
| 721 | + | |
| 722 | + | |
704 | 723 | | |
705 | 724 | | |
706 | 725 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
443 | 443 | | |
444 | 444 | | |
445 | 445 | | |
| 446 | + | |
446 | 447 | | |
447 | 448 | | |
448 | 449 | | |
| |||
454 | 455 | | |
455 | 456 | | |
456 | 457 | | |
457 | | - | |
| 458 | + | |
| 459 | + | |
458 | 460 | | |
459 | 461 | | |
460 | 462 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
39 | | - | |
| 39 | + | |
40 | 40 | | |
41 | 41 | | |
42 | 42 | | |
| |||
521 | 521 | | |
522 | 522 | | |
523 | 523 | | |
| 524 | + | |
524 | 525 | | |
525 | 526 | | |
526 | 527 | | |
527 | 528 | | |
528 | 529 | | |
529 | 530 | | |
530 | | - | |
| 531 | + | |
531 | 532 | | |
532 | 533 | | |
533 | 534 | | |
534 | 535 | | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
| 542 | + | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
| 552 | + | |
| 553 | + | |
| 554 | + | |
| 555 | + | |
| 556 | + | |
| 557 | + | |
| 558 | + | |
| 559 | + | |
| 560 | + | |
| 561 | + | |
| 562 | + | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
| 586 | + | |
| 587 | + | |
| 588 | + | |
| 589 | + | |
| 590 | + | |
| 591 | + | |
| 592 | + | |
| 593 | + | |
| 594 | + | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
| 608 | + | |
| 609 | + | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
| 613 | + | |
| 614 | + | |
535 | 615 | | |
536 | 616 | | |
537 | 617 | | |
| |||
0 commit comments