-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cmd/opampsupervisor] Handle OpAMP connection settings #30237
Conversation
While debugging the test failure I found that wsclient |
I think open-telemetry/opamp-go#213 will fix this. 🤔 |
Both fixes should be there because |
It seems that If we consider |
I see these two fixes as complementary. Even as an internal implementation detail, I believe the context cancellation needs to be respected and closing the connection shouldn't be the only way to break out the receiver loop. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks for adding this.
s.config.Server = newServerConfig | ||
|
||
if err := s.startOpAMP(); err != nil { | ||
s.logger.Error("Cannot connect to the OpAMP server using the new settings", zap.Error(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
startOpAMP() initiates but does not wait for successful connection. I think we need to wait for it before we consider the new opamp settings successful. It can be done in a future PR, but would be good to add a TODO here.
return headers | ||
} | ||
|
||
func (s *Supervisor) onOpampConnectionSettings(_ context.Context, settings *protobufs.OpAMPConnectionSettings) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we try to perform this operation in a separate goroutine so that we don't block here in this callback for the entire duration of switching the connections? Let the callback return quickly and initiate the re-establishing of the connection in a separate goroutine.
This would ensure that if the currently received message contains more data and not just connection settings all that data will be processed. With the current approach I am not sure what exactly will happen.
I think callbacks generally should avoid doing long-lasting blocking operations since they than block other callbacks and the entire opamp operation.
I also think the comment here is completely misleading. We should either implement what the comment says (i.e. the caller should do the reconnection) or fix the comment to sat the callback implementation should do the reconnection. @andykellr @evan-bradley Any thoughts on what you would prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our agent does not currently support AcceptsOpAMPConnectionSettings
so I have not spent much time with this part of the implementation. Looking at the code, the following things are confusing to me:
-
There are several references to
OnRemoteConfig
(e.g. "See OnRemoteConfig for the behavior."). This was removed a while ago. -
This sentence doesn't make sense:
// If OnOpampConnectionSettings returns nil and then the caller will
// attempt to reconnect to the OpAMP Server using the new settings.
We should clearly state who is responsible for establishing a new connection. I could imagine this being fully implemented in the client library and only have OnOpampConnectionSettings
responsible for validating the fields (e.g. endpoint matching a list of acceptable servers or certificate signed by acceptable authority). If this returns nil, the library would attempt to start a new client with the new settings, send an initial status message, and stop and transition to the new connection on success.
If we expect the implementer of the callback to do this, we should describe the process we expect to take place and implement that in tests and the example agent.
- If we do this in a separate goroutine like the example agent in the opamp-go library, we have no way of knowing if it will be successful and should be accepted. Returning nil from the callback will cause the settings to be accepted. If we create a separate goroutine we will have to wait for it before returning here.
It is not possible to confirm success until the client is started and an initial status message is sent.
- OnOpampConnectionSettingsAccepted references OnOpampConnectionSettingsOffer which was renamed OnOpampConnectionSettings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would ensure that if the currently received message contains more data and not just connection settings all that data will be processed. With the current approach I am not sure what exactly will happen.
onOpampConnectionSettings
is the last part of the message processing in client implementation.
If we do this in a separate goroutine like the example agent in the opamp-go library, we have no way of knowing if it will be successful and should be accepted.
Right, the returned value from this callback indicates the reject/accept status of the connection settings. I don't see how we can make this async without changing how the client is expected to handle OnOpampConnectionSettings
. Also, we don't report the error as described in OnOpampConnectionSettings
open-telemetry/opamp-spec#164.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should clearly state who is responsible for establishing a new connection. I could imagine this being fully implemented in the client library and only have OnOpampConnectionSettings responsible for validating the fields (e.g. endpoint matching a list of acceptable servers or certificate signed by acceptable authority). If this returns nil, the library would attempt to start a new client with the new settings, send an initial status message, and stop and transition to the new connection on success.
I think this was the original intent. However, I am not sure this is the best approach. It is also not how the example implementation works and I don't see good arguments against how the example is implemented today.
To summarize this is what the example is supposed to do:
- OnOpampConnectionSettings callback called when new settings are offered by the Server.
- The callback implementation first pre-verifies the settings (e.g. check certificates, etc).
- If checks in (2) pass the callback implementation logs that it is starting to reconnect, creates a reconnection goroutine and returns from the callback.
- Reconnection goroutine stops the Client, creates a new Client with new connection settings and wait for successful OnConnect() callback.
- If OnConnect() is not called within a predefined period of time reconnection goroutine assumes the new connection settings are bad, reverts to the old connection settings and re-creates the Client again.
- We get rid of OnOpampConnectionSettingsAccepted(), it is not needed.
The example implementation of steps 4 and 5 is not complete today (e.g not waiting for OnConnect), but can be modified to match what I described.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created an issue for this open-telemetry/opamp-go#261
Let's also discuss today in our call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, open-telemetry/opamp-go#266 removes OnOpampConnectionSettingsAccepted callback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the implementation to run in a separate goroutine and use OnConnect to determine the connection status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have much to add here, I don't see any potential issues here and @tigrannajaryan has more context here.
ReportsOwnMetrics *bool `mapstructure:"reports_own_metrics"` | ||
ReportsHealth *bool `mapstructure:"reports_health"` | ||
ReportsRemoteConfig *bool `mapstructure:"reports_remote_config"` | ||
AcceptsRemoteConfig *bool `mapstructure:"accepts_remote_config"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I had missed this in earlier, and isn't strictly relevant for the PR but I am curious, what does a pointer value add here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the zero-value for bool
is false
but the default value may be true
, using a pointer instead allows us to use a nil pointer value as a way to represent when the user didn't pass a value in and a default should be used.
|
||
newServerConfig := &config.OpAMPServer{} | ||
|
||
if settings.DestinationEndpoint != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there any value in using some of the confmap
work to merge these values together instead of needing to update each one explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not very familiar with confmap
work. I will take a look at it and If it helps here I can send a separate PR for it.
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
**Description:** <Describe what has changed.> - Linting Errors: [https://productionresultssa5.blob.core.windows.net/actions-results/44e4093a-f4c1-4e35-af4f-d630ea9b8d68/workflow-job-run-071cf28a-9760-544e-287f-c0f9ae5a03a4/logs/job/job-logs.txt?rsct=text%2Fplain&se=2024-03-25T12%3A45%3A21Z&sig=fRzhKwEpzDQDQZyUYwvEY%2BhkcFWOD3IY0EeCOE47AeU%3D&sp=r&spr=https&sr=b&st=2024-03-25T12%3A35%3A16Z&sv=2021-12-02](url) - some linting errors will be fixed by #30237 **Link to tracking Issue:** <Issue number if applicable> - #31240 **Testing:** <Describe what testing was performed and which tests were added.> **Documentation:** <Describe the documentation added.>
Could I ask you to clean up the mod file? |
Link to tracking Issue:
Part of #21043; based on top of #29848 to add test
Testing:
Added integration test