Skip to content

Magento discards most New Relic cron transactions #34108

Closed
@convenient

Description

@convenient

Summary

Magento is discarding all new relic transaction logging/runtime information/errors for every cron job after the first one executed per process.

For example if you had three pending jobs job1, job2, and job3 you would only have a transaction for job1 and no info at all in new relic for job2 and job3.

This is a critical issue IMO as it reduces the effectiveness of our ability to support the Magento instance. I assumed I was getting all relevant data in new relic but I am actually missing performance analytics as well as possible errors that are not being tracked in new relic. This affected me debugging downtime issues affecting a website.

Cause

This issue was introduced in 2.4.0 by 5819ded

In that commit there were attempts to

  1. stop logging the cron:run command as a command (the isCommandSkipped logic)
  2. set the transaction name for each pending job processed in the queue.

The current implementation works a bit like so, starting at Magento\Cron\Observer\ProcessCronQueueObserver::execute

$this->processPendingJobs($groupId, $jobsRoot, $currentTime);

calls

$this->tryRunJob($scheduledTime, $currentTime, $jobConfig, $schedule, $groupId);

calls

$this->_runJob($scheduledTime, $currentTime, $jobConfig, $schedule, $groupId);

This is the part of the process that actually executes the job, it does so like this

$this->startProfiling($jobCode);
$this->eventManager->dispatch('cron_job_run', ['job_name' => 'cron/' . $groupId . '/' . $jobCode]);
try {
$this->logger->info(sprintf('Cron Job %s is run', $jobCode));
//phpcs:ignore Magento2.Functions.DiscouragedFunction
call_user_func_array($callback, [$schedule]);
} catch (\Throwable $e) {
$schedule->setStatus(Schedule::STATUS_ERROR);
$this->logger->error(
sprintf(
'Cron Job %s has an error: %s. Statistics: %s',
$jobCode,
$e->getMessage(),
$this->getProfilingStat($jobCode)
)
);
if (!$e instanceof \Exception) {
$e = new \RuntimeException(
'Error when running a cron job',
0,
$e
);
}
throw $e;
} finally {
$this->stopProfiling($jobCode);
}

You can see that each execution of a job is wrapped with startProfiling and stopProfiling.

These functions have plugins in them to set newrelic_name_transaction and newrelic_end_transaction

public function beforeStart(Stat $schedule, ...$args): array
{
$timerName = current($args);
if ($this->isCronJob($timerName)) {
$this->newRelicWrapper->setTransactionName(
sprintf('Cron %s', $timerName)
);
}
return $args;
}

public function beforeStop(Stat $schedule, ...$args): array
{
$timerName = current($args);
if ($this->isCronJob($timerName)) {
$this->newRelicWrapper->endTransaction();
}
return $args;
}

The issue is that we call newrelic_end_transaction but nowhere in the codebase is there another call to newrelic_start_transaction, this is required to track subsequent jobs.

For task queue managers, you can use newrelic_end_transaction() and newrelic_start_transaction() together to manually separate transactions

This is the major bug, without this call to restart the transaction all subsequent data is not tracked and reported to New Relic.

There is a second quality of life issue. Due to the location where we set the initial transaction name we end up with two traces per cron execution, one of them is flagged as unknown which encompasses everything up until we call newrelic_name_transaction for the first time, so basically when you see unknown on the CLI New Relic data you can think of it as "the time it takes to boot up cron:run and get to the point where you run the first actual job".

In effect what we have is the following

<?php
sleep(10); // cron:run loader
foreach ($i=1; $i<=3; $i++) {
    newrelic_name_transaction("job$i");
    sleep(10);
    newrelic_end_transaction();
}

Results

  • unknown transaction containing the booting of cron:run
  • job1 transaction only

When we need something like

<?php
newrelic_name_transaction("cron:run loader");
sleep(10); // cron:run loader
newrelic_end_transaction();
newrelic_start_transaction(ini_get("newrelic.appname"));
foreach ($i=1; $i<=3; $i++) {
    newrelic_name_transaction("job$i");
    sleep(10);
    newrelic_end_transaction();
    newrelic_start_transaction(ini_get("newrelic.appname"));
}

Results

  • cron:run loader transaction containing the booting of cron:run
  • job1 transaction
  • job2 transaction
  • job3 transaction

Hotfix

For anyone looking for an urgent patch to get their crons all properly logged in New Relic the following seems to be working for me. This ensures we always have a transaction running until this process terminates.

--- Model/NewRelicWrapper.php	2021-01-07 21:56:58.000000000 +0000
+++ Model/NewRelicWrapper.php	2021-09-17 09:11:31.000000000 +0100
@@ -79,6 +79,7 @@
     {
         if ($this->isExtensionInstalled()) {
             newrelic_end_transaction($ignore);
+            newrelic_start_transaction(ini_get("newrelic.appname"));
         }
     }

Proper solution

I have some rough ideas but I'll leave this for someone in Magento to pick up. I cant see any trivial way of doing what I propose as we'd need to couple the Cron module to New Relic reporting module a bit tighter.

Firstly we can give the cron:run its old transaction name back by removing this.

<type name="Magento\NewRelicReporting\Plugin\CommandPlugin">
<arguments>
<argument name="skipCommands" xsi:type="array">
<item xsi:type="boolean" name="cron:run">true</item>
</argument>
</arguments>
</type>

Then before we call startProfiling in ProcessCronQueueObserver.php we need to end the current new relic transaction, then start a new one.

It would also feel weird for NewRelicWrapper::endTransaction to have the responsibility of newrelic_start_transaction, but I wonder whether adding a new NewRelicWrapper::startTransaction would break backwards compatibility if its a new public method 🤔 But the breakage here is so severe that it makes the cron data for new relic practically without use so something needs to be addressed here.


Please provide Severity assessment for the Issue as Reporter. This information will help during Confirmation and Issue triage processes.

  • Severity: S0 - Affects critical data or functionality and leaves users without workaround.
  • Severity: S1 - Affects critical data or functionality and forces users to employ a workaround.
  • Severity: S2 - Affects non-critical data or functionality and forces users to employ a workaround.
  • Severity: S3 - Affects non-critical data or functionality and does not force users to employ a workaround.
  • Severity: S4 - Affects aesthetics, professional look and feel, “quality” or “usability”.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Labels

Area: FrameworkIssue: ConfirmedGate 3 Passed. Manual verification of the issue completed. Issue is confirmedPriority: P2A defect with this priority could have functionality issues which are not to expectations.Progress: doneReproduced on 2.4.xThe issue has been reproduced on latest 2.4-develop branchTriage: Dev.ExperienceIssue related to Developer Experience and needs help with Triage to Confirm or Reject it

Type

No type

Projects

  • Status

    Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions