Skip to content

NPE while collecting worker metrics #18545

@pettermahlen

Description

@pettermahlen

Description of the bug:

We occasionally see this error in CI:

java.lang.NullPointerException: Null lastCallTime
      at com.google.devtools.build.lib.worker.AutoValue_WorkerMetric_WorkerStat.<init>(AutoValue_WorkerMetric_WorkerStat.java:21)
      at com.google.devtools.build.lib.worker.WorkerMetric$WorkerStat.create(WorkerMetric.java:51)
      at com.google.devtools.build.lib.worker.WorkerMetricsCollector.collectMetrics(WorkerMetricsCollector.java:224)
      at com.google.devtools.build.lib.profiler.CollectLocalResourceUsage.run(CollectLocalResourceUsage.java:179)

It appears to be caused by a race between this method:

  public void registerWorker(WorkerMetric.WorkerProperties properties) {
    int workerId = properties.getWorkerId();

    workerIdToWorkerProperties.putIfAbsent(workerId, properties);
    workerLastCallTime.put(workerId, Instant.ofEpochMilli(clock.currentTimeMillis()));
  }

And the iteration over the keys in the workerIdToWorkerProperties map in collectMetrics, where it's possible for the workerLastCallTime map to not yet have an entry for a worker ID that is present in the workerIdToWorkerProperties map.

Since d31dd09, this should no longer be a problem on master, but we're running an older version of Bazel.

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No easy repro.

Which operating system are you running Bazel on?

linux, windows, macos

What is the output of bazel info release?

release 6.1.0-ec97d6a

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

We apply some local patches, they should not affect this behaviour.

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

d233c89 introduced the race.

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

This patch probably fixes the issue by ensuring that the workerLastCallTime map gets populated before the workerIdToWorkerProperties map:

diff --git a/src/main/java/com/google/devtools/build/lib/worker/WorkerMetricsCollector.java b/src/main/java/com/google/devtools/build/lib/worker/WorkerMetricsCollector.java
index b301f6db93..ecee7b4965 100644
--- a/src/main/java/com/google/devtools/build/lib/worker/WorkerMetricsCollector.java
+++ b/src/main/java/com/google/devtools/build/lib/worker/WorkerMetricsCollector.java
@@ -263,8 +263,8 @@ public class WorkerMetricsCollector {
   public void registerWorker(WorkerMetric.WorkerProperties properties) {
     int workerId = properties.getWorkerId();

-    workerIdToWorkerProperties.putIfAbsent(workerId, properties);
     workerLastCallTime.put(workerId, Instant.ofEpochMilli(clock.currentTimeMillis()));
+    workerIdToWorkerProperties.putIfAbsent(workerId, properties);
   }

   private synchronized MetricsWithTime updateLastCollectMetrics(

We're going include that in our local patches, and should be able to confirm whether that fixes the issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3We're not considering working on this, but happy to review a PR. (No assignee)help wantedSomeone outside the Bazel team could own thisteam-Local-ExecIssues and PRs for the Execution (Local) teamtype: bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions