Update Cloud Run Job `task_id` to avoid high cardinality? #874

alethenorio · 2024-08-01T19:03:09Z

I recently setup Opentelemetry in a Go application running inside a Cloud Run Job and noticed that the monitored resource task_id label kept changing on every single write. In this specific case I am going with a fairly frequent run of the job (several times an hour) which immediately made me think that, long term, it might cause issues with a high cardinality on every metric being written by that job.

After some debugging, I can see that the gcp.NewDetector() is configured to return the instance ID from the metadata server (which comes out to be a long ID like 0087244a809d22283efa2....) which is turn is used by otel as the FaaSInstanceKey which is eventually exported as the task_id in a generic task.

Reading the definition of the label, I am left wondering whether it might make more sense to default to something like the Cloud Run Job task index to avoid issues with long term high cardinality.

Or is this not to be concerned (not a monitoring expert here)?

The text was updated successfully, but these errors were encountered:

dashpole · 2024-08-01T19:18:23Z

Do you know if only a single instance can exist for a given task index? We just need to make sure there aren't collisions

alethenorio · 2024-08-01T21:55:59Z

What kind of collision are you thinking of? If 2 tasks were to share an instance ID and write the exact same metric roughly at the same time?

According to the Cloud Run Job documentation each task gets their own instance.

Each task runs one container instance and can be configured to retry in case of failure.

dashpole · 2024-09-04T15:41:55Z

xref: #465

damemi · 2024-09-04T17:32:27Z

Do you mean the TASK_INDEX? As I understand, that is a number for each task within a job (ie, 0, 1, ...). This means runs of the same job as different invocations would end up with the same labels for job+task_id. I don't think that would be a problem with cloud monitoring.

But, using the Instance ID matches more closely to the OTel faas conventions than Task ID (https://opentelemetry.io/docs/specs/semconv/attributes-registry/faas/#function-as-a-service-attributes)

We did add the gcp.cloud_run.job.task_index attribute for the reason that instance ID was already available as the task_id resource label. This actually prompted a lengthy discussion about how Cloud Run Jobs don't fit well into the OTel conventions (open-telemetry/opentelemetry-specification#3378, if you're interested). They're even pretty different from Cloud Run Services because of this.

From what I understand, the Cloud Monitoring backend scales well enough that instance ID shouldn't cause cardinality issues there. But if you are using the GCP resource detector and exporting that data somewhere else, the task_index should already be available by that label for you to use.

dashpole self-assigned this Aug 1, 2024

dashpole added the question Further information is requested label Aug 1, 2024

dashpole added bug Something isn't working priority: p2 enhancement New feature or request and removed question Further information is requested bug Something isn't working labels Sep 4, 2024

dashpole assigned damemi and unassigned dashpole Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Cloud Run Job `task_id` to avoid high cardinality? #874

Update Cloud Run Job `task_id` to avoid high cardinality? #874

alethenorio commented Aug 1, 2024

dashpole commented Aug 1, 2024

alethenorio commented Aug 1, 2024

dashpole commented Sep 4, 2024

damemi commented Sep 4, 2024

Update Cloud Run Job task_id to avoid high cardinality? #874

Update Cloud Run Job task_id to avoid high cardinality? #874

Comments

alethenorio commented Aug 1, 2024

dashpole commented Aug 1, 2024

alethenorio commented Aug 1, 2024

dashpole commented Sep 4, 2024

damemi commented Sep 4, 2024

Update Cloud Run Job `task_id` to avoid high cardinality? #874

Update Cloud Run Job `task_id` to avoid high cardinality? #874