-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition in otelcol component wrappers #2027
base: main
Are you sure you want to change the base?
Conversation
f07ea86
to
122a517
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this, I think that's a step in the right direction.
The logic was not super clear to me at first in the scheduler, the idea is that the components are created in the paused state so they are not paused on the first run but are always resumed after being started because they are paused when the scheduler stops running, right? Maybe we could have this explanation in the scheduler code
// called before. See Pause for more details. | ||
func (c *Consumer) Resume() { | ||
c.pauseMut.Unlock() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If pause is called by accident on a paused consumer or resume on a running consumer, it will be dramatic. Should the consumer holds its state (smth like "isPaused") to avoid locking if its already paused or avoid resuming if it's already running for extra safety?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was thinking about this too, I wouldn't want to have any dramatic panics either. I've changed the implementation to handle multiple calls to Resume and Pause - it's a bit more complex, but I also wrote a lot of tests.
@@ -100,6 +129,7 @@ func (cs *Scheduler) Run(ctx context.Context) error { | |||
|
|||
level.Debug(cs.log).Log("msg", "scheduling components", "count", len(components)) | |||
components = cs.startComponents(ctx, host, components...) | |||
cs.onResume() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the component was created with New() instead of NewPaused() then this will panic because on the first run you will try to unlock the mut that's not locked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it will no longer panic as calls to Resume can be repeated.
When writing code we still need to take care to use the correct paused or resumed consumer, but the consequences of a mistake shouldn't be catastrophic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some edits to align with the same changes in #2012 Might also be "fixed" up when the PR conflicts are resolved?
docs/sources/reference/components/otelcol/otelcol.exporter.datadog.md
Outdated
Show resolved
Hide resolved
docs/sources/reference/components/otelcol/otelcol.connector.spanmetrics.md
Outdated
Show resolved
Hide resolved
docs/sources/reference/components/otelcol/otelcol.processor.interval.md
Outdated
Show resolved
Hide resolved
c7891f6
to
3a1524d
Compare
c2d75e8
to
7d84754
Compare
PR Description
We have a general issue with OTel components where consumers may be used before the Start functions in OTel have finished running. This is because in OTel Start functions are non-blocking and sometimes do work to set things up, like it was the case for batch_processor. In Alloy, however we have Run function that is blocking for the lifetime of the component. As soon as it's called, we consider the component Running. In OTel, however, the Start function should be called and exit to consider a component running.
The solution here is to pause the Consumer until we are sure that the OTel component scheduler has called
Start
on all OTel components. The consumer will block any attempts to feed data to it.Which issue(s) this PR fixes
Notes to the Reviewer
PR Checklist