Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Cloud Run Job task_id to avoid high cardinality? #874

Open
alethenorio opened this issue Aug 1, 2024 · 4 comments
Open

Update Cloud Run Job task_id to avoid high cardinality? #874

alethenorio opened this issue Aug 1, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request priority: p2

Comments

@alethenorio
Copy link

I recently setup Opentelemetry in a Go application running inside a Cloud Run Job and noticed that the monitored resource task_id label kept changing on every single write. In this specific case I am going with a fairly frequent run of the job (several times an hour) which immediately made me think that, long term, it might cause issues with a high cardinality on every metric being written by that job.

After some debugging, I can see that the gcp.NewDetector() is configured to return the instance ID from the metadata server (which comes out to be a long ID like 0087244a809d22283efa2....) which is turn is used by otel as the FaaSInstanceKey which is eventually exported as the task_id in a generic task.

Reading the definition of the label, I am left wondering whether it might make more sense to default to something like the Cloud Run Job task index to avoid issues with long term high cardinality.

Or is this not to be concerned (not a monitoring expert here)?

@dashpole dashpole self-assigned this Aug 1, 2024
@dashpole dashpole added the question Further information is requested label Aug 1, 2024
@dashpole
Copy link
Contributor

dashpole commented Aug 1, 2024

Do you know if only a single instance can exist for a given task index? We just need to make sure there aren't collisions

@alethenorio
Copy link
Author

What kind of collision are you thinking of? If 2 tasks were to share an instance ID and write the exact same metric roughly at the same time?

According to the Cloud Run Job documentation each task gets their own instance.

Each task runs one container instance and can be configured to retry in case of failure.

@dashpole dashpole added bug Something isn't working priority: p2 enhancement New feature or request and removed question Further information is requested bug Something isn't working labels Sep 4, 2024
@dashpole
Copy link
Contributor

dashpole commented Sep 4, 2024

xref: #465

@dashpole dashpole assigned damemi and unassigned dashpole Sep 4, 2024
@damemi
Copy link
Member

damemi commented Sep 4, 2024

Do you mean the TASK_INDEX? As I understand, that is a number for each task within a job (ie, 0, 1, ...). This means runs of the same job as different invocations would end up with the same labels for job+task_id. I don't think that would be a problem with cloud monitoring.

But, using the Instance ID matches more closely to the OTel faas conventions than Task ID (https://opentelemetry.io/docs/specs/semconv/attributes-registry/faas/#function-as-a-service-attributes)

We did add the gcp.cloud_run.job.task_index attribute for the reason that instance ID was already available as the task_id resource label. This actually prompted a lengthy discussion about how Cloud Run Jobs don't fit well into the OTel conventions (open-telemetry/opentelemetry-specification#3378, if you're interested). They're even pretty different from Cloud Run Services because of this.

From what I understand, the Cloud Monitoring backend scales well enough that instance ID shouldn't cause cardinality issues there. But if you are using the GCP resource detector and exporting that data somewhere else, the task_index should already be available by that label for you to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority: p2
Projects
None yet
Development

No branches or pull requests

3 participants