Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler uses [scheduler]job_heartbeat_sec property for heartbeats instead of [scheduler]scheduler_heartbeat_sec #37971

Closed
1 of 2 tasks
kosteev opened this issue Mar 7, 2024 · 16 comments · Fixed by #37992
Closed
1 of 2 tasks
Labels
affected_version:2.8 Issues Reported for 2.8 area:core good first issue kind:bug This is a clearly a bug

Comments

@kosteev
Copy link
Contributor

kosteev commented Mar 7, 2024

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.6.3

What happened?

Scheduler uses [scheduler]job_heartbeat_sec property for heartbeats instead of [scheduler]scheduler_heartbeat_sec.
In Airflow 2.5.3 [scheduler]scheduler_heartbeat_sec is used correctly.

The bug was introduced after refactoring of scheduler_job -> scheduler_job_runner.
This field is never used in the code:

heartrate: int = conf.getint("scheduler", "SCHEDULER_HEARTBEAT_SEC")

What you think should happen instead?

No response

How to reproduce

  • run Airflow
  • set [scheduler]job_heartbeat_sec with the value greater than [scheduler]scheduler_health_check_threshold
  • see warning message in Airflow UI that scheduler is not alive

Operating System

Ubuntu

Versions of Apache Airflow Providers

No response

Deployment

Google Cloud Composer

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@kosteev kosteev added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Mar 7, 2024
@Taragolis Taragolis added affected_version:2.8 Issues Reported for 2.8 and removed needs-triage label for new issues that we didn't triage yet labels Mar 7, 2024
@potiuk
Copy link
Member

potiuk commented Mar 7, 2024

Are you sure @kosteev ?

From what I see it is used n Job's perform_heartbeat method::

def perform_heartbeat(
    job: Job, heartbeat_callback: Callable[[Session], None], only_if_necessary: bool
) -> None:
    """
    Perform heartbeat for the Job passed to it,optionally checking if it is necessary.

    :param job: job to perform heartbeat for
    :param heartbeat_callback: callback to run by the heartbeat
    :param only_if_necessary: only heartbeat if it is necessary (i.e. if there are things to run for
        triggerer for example)
    """
    seconds_remaining: float = 0.0
    if job.latest_heartbeat and job.heartrate: # <--- HERE THE HEARTRATE IS USED FROM JOB (and one line down)
        seconds_remaining = job.heartrate - (timezone.utcnow() - job.latest_heartbeat).total_seconds()
    if seconds_remaining > 0 and only_if_necessary:
        return
    job.heartbeat(heartbeat_callback=heartbeat_callback)

I believe there were some fixes after 2.6.3 that fixed bad threshold calculations for some jobs (It was confusing whether the job used 2.1* multiplier or whether you had fixed value from the configuration) but I'd expect heartrate was not affected - maybe threshold is (or rather was) calculated wrongly.

Also this scenario:

1. run Airflow
2. set [scheduler]job_heartbeat_sec with the value greater than [scheduler]scheduler_health_check_threshold
3. see warning message in Airflow UI that scheduler is not alive

is expected. If heartbeat sec is greater than threshold, then yes you'd expect to see the warning (the warning is issued when the heartbeat does not arrive in the threshold time - which would be exactly what is happening if those values are set as described).

Mybe the reproduction scenario is wrong? Could you please clarify ?

@kosteev
Copy link
Contributor Author

kosteev commented Mar 8, 2024

In perform_heartbeat, job.heartrate is used.
The issue is that for scheduler it is defined in SchedulerJobRunner (previously was SchedulerJob):

heartrate: int = conf.getint("scheduler", "SCHEDULER_HEARTBEAT_SEC")

so, now this heartrate: int = conf.getint("scheduler", "SCHEDULER_HEARTBEAT_SEC") from SchedulerJobRunner is not used anywhere

@kosteev
Copy link
Contributor Author

kosteev commented Mar 8, 2024

About reproduction scenario:
"If heartbeat sec is greater than threshold, then yes you'd expect to see the warning" - yes, but in the scenario you modify [scheduler]job_heartbeat_sec with property which shouldn't have to do anything with scheduler, as scheduler heartbeat should be controlled by [scheduler]scheduler_heartbeat_sec

@potiuk
Copy link
Member

potiuk commented Mar 8, 2024

so, now this heartrate: int = conf.getint("scheduler", "SCHEDULER_HEARTBEAT_SEC") from SchedulerJobRunner is not used anywhere

Indeed.

"If heartbeat sec is greater than threshold, then yes you'd expect to see the warning" - yes, but in the scenario you modify [scheduler]job_heartbeat_sec with property which shouldn't have to do anything with scheduler, as scheduler heartbeat should be controlled by [scheduler]scheduler_heartbeat_sec

Ah 🤦 -> indeed missed that it is about the other config.

Would you like to fix it maybe @kosteev ? That seems like a good contribution and simple fix to make ?

@kosteev
Copy link
Contributor Author

kosteev commented Mar 8, 2024

I will try to prepare a fix, but I do not want to commit to this, so that if anyone wants to fix it.
I will try to work on it today, early next week.

potiuk added a commit to potiuk/airflow that referenced this issue Mar 8, 2024
Sinc apache#34026 scheduler heartrate has not been properly calculated.
We missed the check for SchedulerJob type and setting heartrate
value from `scheduler_health_check_threshold`.

This PR fixes it.

Fix: apache#37971
@potiuk
Copy link
Member

potiuk commented Mar 8, 2024

I actualy took a quick look and it turned out that it was removed after adding SchedulerJobRunner. The SchedulerJob check has been removed by #34026 a bit later.

I prepared a fix with test in #37992

@potiuk
Copy link
Member

potiuk commented Mar 8, 2024

Actually not even there

UPDATE: Nope - it was there in #34026 - it just missed the SchedulerJob if with _heartrate when the if was moved from _is_alive ... I missed it during review :(

@potiuk
Copy link
Member

potiuk commented Mar 8, 2024

@kosteev - are you sure it's affecting 2.6.3 Airflow ? from what I traced #34026 has been only part of v2-8 branch. Maybe there was another reason before in 2.6.3 - but quite certainly this one was only introduced in 2.8.0

@potiuk
Copy link
Member

potiuk commented Mar 8, 2024

The 2.6.3 version is:

    def is_alive(self, grace_multiplier=2.1):
        """
        Is this job currently alive.

        We define alive as in a state of RUNNING, and having sent a heartbeat
        within a multiple of the heartrate (default of 2.1)

        :param grace_multiplier: multiplier of heartrate to require heart beat
            within
        """
        if self.job_type == "SchedulerJob":
            health_check_threshold: int = conf.getint("scheduler", "scheduler_health_check_threshold")
        else:
            health_check_threshold: int = self.heartrate * grace_multiplier
        return (
            self.state == State.RUNNING
            and (timezone.utcnow() - self.latest_heartbeat).total_seconds() < health_check_threshold
        )

So almost for sure it wouldn't have caused the problem.

(BTW. I am trying to see if we can add the fix to 2.8.3 that we vote on currently - because otherwise it would affect the whole 2.8 line)

@potiuk
Copy link
Member

potiuk commented Mar 8, 2024

Ahh.. those parameters are confusing and I am mixing them ... it is indeed way older sorry (threshold mixed with secs)

potiuk added a commit to potiuk/airflow that referenced this issue Mar 8, 2024
Sinc apache#30255 scheduler heartrate has not been properly calculated.
We missed the check for SchedulerJob type and setting heartrate
value from `scheduler_health_check_threshold`.

This PR fixes it.

Fix: apache#37971
@potiuk
Copy link
Member

potiuk commented Mar 8, 2024

Tracked it down (there were quite a few refactors) and indeed it was #30255 that introduced it

@potiuk
Copy link
Member

potiuk commented Mar 8, 2024

So it's there since 2.6.0

@potiuk
Copy link
Member

potiuk commented Mar 8, 2024

(just wanted to make sure how far back it goes)

@kosteev
Copy link
Contributor Author

kosteev commented Mar 8, 2024

I prepared a fix like this #37993

But I see you have already fix (and implemented differently). Should I close my PR?

potiuk added a commit to potiuk/airflow that referenced this issue Mar 8, 2024
Sinc apache#30255 scheduler heartrate has not been properly calculated.
We missed the check for SchedulerJob type and setting heartrate
value from `scheduler_health_check_threshold`.

This PR fixes it.

Fix: apache#37971
@potiuk
Copy link
Member

potiuk commented Mar 8, 2024

Yeah - I think the way where it is closer to heartrate property is a bit better and easier to reason about.

@potiuk
Copy link
Member

potiuk commented Mar 8, 2024

We have a bit of a mess there - because the JOB (scheduler/job_heartrabeat_sec) has been used and abused by different jobs (not only scheduler one), so likely better to keep it in JOB.

potiuk added a commit that referenced this issue Mar 8, 2024
Sinc #30255 scheduler heartrate has not been properly calculated.
We missed the check for SchedulerJob type and setting heartrate
value from `scheduler_health_check_threshold`.

This PR fixes it.

Fix: #37971
howardyoo pushed a commit to howardyoo/airflow that referenced this issue Mar 18, 2024
…7992)

Sinc apache#30255 scheduler heartrate has not been properly calculated.
We missed the check for SchedulerJob type and setting heartrate
value from `scheduler_health_check_threshold`.

This PR fixes it.

Fix: apache#37971
jedcunningham pushed a commit that referenced this issue Mar 18, 2024
Sinc #30255 scheduler heartrate has not been properly calculated.
We missed the check for SchedulerJob type and setting heartrate
value from `scheduler_health_check_threshold`.

This PR fixes it.

Fix: #37971
(cherry picked from commit 01e40ab)
howardyoo pushed a commit to howardyoo/airflow that referenced this issue Mar 31, 2024
…7992)

Sinc apache#30255 scheduler heartrate has not been properly calculated.
We missed the check for SchedulerJob type and setting heartrate
value from `scheduler_health_check_threshold`.

This PR fixes it.

Fix: apache#37971
utkarsharma2 pushed a commit to astronomer/airflow that referenced this issue Apr 22, 2024
…7992)

Sinc apache#30255 scheduler heartrate has not been properly calculated.
We missed the check for SchedulerJob type and setting heartrate
value from `scheduler_health_check_threshold`.

This PR fixes it.

Fix: apache#37971
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.8 Issues Reported for 2.8 area:core good first issue kind:bug This is a clearly a bug
Projects
None yet
3 participants