Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change metric.Producer to be an Option on Reader #4346

Merged
merged 3 commits into from
Aug 11, 2023

Conversation

dashpole
Copy link
Contributor

@dashpole dashpole commented Jul 20, 2023

Updates the MetricProducer implementation to comply with open-telemetry/opentelemetry-specification#3613

@codecov
Copy link

codecov bot commented Jul 20, 2023

Codecov Report

Merging #4346 (d204195) into main (7b9fb7a) will decrease coverage by 0.1%.
The diff coverage is 100.0%.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #4346     +/-   ##
=======================================
- Coverage   78.8%   78.8%   -0.1%     
=======================================
  Files        253     253             
  Lines      20644   20630     -14     
=======================================
- Hits       16281   16267     -14     
  Misses      4014    4014             
  Partials     349     349             
Files Changed Coverage Δ
sdk/metric/manual_reader.go 74.1% <100.0%> (-2.4%) ⬇️
sdk/metric/periodic_reader.go 85.0% <100.0%> (+0.1%) ⬆️
sdk/metric/reader.go 100.0% <100.0%> (ø)

... and 1 file with indirect coverage changes

@dashpole dashpole force-pushed the prototype_metricreader_args branch from 4cae38d to 1740d9c Compare July 20, 2023 15:53
@dashpole
Copy link
Contributor Author

@MrAlias The register pattern came from your comment here: open-telemetry/opentelemetry-specification#2722 (comment)

@dashpole dashpole closed this Jul 20, 2023
@dashpole dashpole reopened this Aug 10, 2023
@dashpole dashpole force-pushed the prototype_metricreader_args branch 3 times, most recently from 3f19cae to 60473b7 Compare August 10, 2023 19:38
@dashpole dashpole changed the title Prototype for metric.Producer as an argument to Reader Change metric.Producer to be an Option on Reader Aug 10, 2023
@dashpole dashpole marked this pull request as ready for review August 10, 2023 19:39
@pellared
Copy link
Member

pellared commented Aug 11, 2023

I think there is a potential race condition (which existed before) when Shutdown is invoked during Collect.

The reader is initialized with some externalProducers.

  1. Goroutine 1 calling Collect reaches https://github.com/dashpole/opentelemetry-go/blob/60473b75286b9b2e87d6021db1d9056565f577cf/sdk/metric/manual_reader.go#L136 (the r.externalProducers containes some elements)
  2. Goroutine 2 calls Shutdown and finishes -> r.externalProducers is set to nil (inside a lock)
  3. Goroutine 1 continues and enumerates range mr.externalProducers without any synchronization

The same problem could occur if one manually invokes Collect on PeriodicReader.

How to fix it? I am not sure 😉

My initial thought is to change mu sync.Mutex to mu sync.RWMutex and using mu.RLock in Collect. Shutdown would clear the state when no collect is running. Then we could also use the ctx passed to Shutdown (which is currently not used) to make sure that the client have control how long he waits until Shutdown is completed. This would be a "graceful shutdown".

My second idea is to remove mu sync.Mutex and isShutdown bool, and replace externalProducers []Producer with externalProducers sync.Pointer[[]Producer]. Then we will have a lock-free implementation. This would be a kind of "force shutdown". This would be more performant (consume less resources). The only drawback that I could think of is that Shutdown is not that well synchronized: when Shutdown finishes, they still may be some processing in place. When metricProvider.Shutdown() returns they still may be some manual Collects running. However, the code that would be running would be the one called by the user - the SDK's period reader collects would be already done thanks to https://github.com/dashpole/opentelemetry-go/blob/60473b75286b9b2e87d6021db1d9056565f577cf/sdk/metric/periodic_reader.go#L330. If the caller calls Collect in a goroutine - they should make sure it finishes. Personally, I would lean to this solution. While in "business software I would say that it is too complex, for us I think minimalization of resource consumption is very important.

This comment is not blocking this PR as this PR does not introduce this problem. I just noticed it when reviewing.

@dashpole
Copy link
Contributor Author

I'm surprised this isn't caught by our concurrency tests...

@dashpole
Copy link
Contributor Author

dashpole commented Aug 11, 2023

I "fixed" our concurrency tests to expose the race condition.

@dashpole
Copy link
Contributor Author

I believe switching back to atomic.Value to hold producers fixed the race condition.

Copy link
Contributor

@MrAlias MrAlias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇

@MrAlias MrAlias merged commit fe51391 into open-telemetry:main Aug 11, 2023
21 checks passed
@pellared
Copy link
Member

I believe switching back to atomic.Value to hold producers fixed the race condition.

👍 PS. I have not noticed that it was atomic.Value before 😬 Anyway, I am happy that I was able to find the issue 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

3 participants