Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cmd/opampsupervisor] Handle OpAMP connection settings #30237

Merged
merged 24 commits into from
Apr 9, 2024

Conversation

srikanthccv
Copy link
Member

@srikanthccv srikanthccv commented Dec 30, 2023

Link to tracking Issue:

Part of #21043; based on top of #29848 to add test

Testing:

Added integration test

@srikanthccv
Copy link
Member Author

srikanthccv commented Jan 17, 2024

While debugging the test failure I found that wsclient Stop gets indefinitely blocked by ReceiverLoop unless some error occurs (or conn close). I attempted to fix that here open-telemetry/opamp-go#240. @evan-bradley @tigrannajaryan take a look when you get some time.

@haoqixu
Copy link
Member

haoqixu commented Jan 18, 2024

While debugging the test failure I found that wsclient Stop gets indefinitely blocked by ReceiverLoop unless some error occurs (or conn close). I attempted to fix that here open-telemetry/opamp-go#240. @evan-bradley @tigrannajaryan take a look when you get some time.

I think open-telemetry/opamp-go#213 will fix this. 🤔

@srikanthccv
Copy link
Member Author

Both fixes should be there because ReceiverLoop should stop on the context cancellation as well.

@haoqixu
Copy link
Member

haoqixu commented Jan 18, 2024

Both fixes should be there because ReceiverLoop should stop on the context cancellation as well.

It seems that gorilla/websocket doesn't provide a method to unblock the bloking reading of connection. And we need to close the connection to unblock the reading.

If we consider ReceiverLoop as an internal implementation detail of the wsclient and we could gracefully shut down the wsclient, I think it's not a problem.

@srikanthccv
Copy link
Member Author

I see these two fixes as complementary. Even as an internal implementation detail, I believe the context cancellation needs to be respected and closing the connection shouldn't be the only way to break out the receiver loop.

@srikanthccv srikanthccv marked this pull request as ready for review January 25, 2024 19:08
@srikanthccv srikanthccv requested a review from a team January 25, 2024 19:08
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

Copy link
Contributor

@evan-bradley evan-bradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks for adding this.

s.config.Server = newServerConfig

if err := s.startOpAMP(); err != nil {
s.logger.Error("Cannot connect to the OpAMP server using the new settings", zap.Error(err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startOpAMP() initiates but does not wait for successful connection. I think we need to wait for it before we consider the new opamp settings successful. It can be done in a future PR, but would be good to add a TODO here.

return headers
}

func (s *Supervisor) onOpampConnectionSettings(_ context.Context, settings *protobufs.OpAMPConnectionSettings) error {
Copy link
Member

@tigrannajaryan tigrannajaryan Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we try to perform this operation in a separate goroutine so that we don't block here in this callback for the entire duration of switching the connections? Let the callback return quickly and initiate the re-establishing of the connection in a separate goroutine.

This would ensure that if the currently received message contains more data and not just connection settings all that data will be processed. With the current approach I am not sure what exactly will happen.

I think callbacks generally should avoid doing long-lasting blocking operations since they than block other callbacks and the entire opamp operation.

I also think the comment here is completely misleading. We should either implement what the comment says (i.e. the caller should do the reconnection) or fix the comment to sat the callback implementation should do the reconnection. @andykellr @evan-bradley Any thoughts on what you would prefer?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our agent does not currently support AcceptsOpAMPConnectionSettings so I have not spent much time with this part of the implementation. Looking at the code, the following things are confusing to me:

  1. There are several references to OnRemoteConfig (e.g. "See OnRemoteConfig for the behavior."). This was removed a while ago.

  2. This sentence doesn't make sense:

// If OnOpampConnectionSettings returns nil and then the caller will
// attempt to reconnect to the OpAMP Server using the new settings.

We should clearly state who is responsible for establishing a new connection. I could imagine this being fully implemented in the client library and only have OnOpampConnectionSettings responsible for validating the fields (e.g. endpoint matching a list of acceptable servers or certificate signed by acceptable authority). If this returns nil, the library would attempt to start a new client with the new settings, send an initial status message, and stop and transition to the new connection on success.

If we expect the implementer of the callback to do this, we should describe the process we expect to take place and implement that in tests and the example agent.

  1. If we do this in a separate goroutine like the example agent in the opamp-go library, we have no way of knowing if it will be successful and should be accepted. Returning nil from the callback will cause the settings to be accepted. If we create a separate goroutine we will have to wait for it before returning here.

https://github.com/open-telemetry/opamp-go/blob/7e92da0f17ef9f2fd0a387dd6b62b451c80f4207/client/internal/receivedprocessor.go#L198-L202

It is not possible to confirm success until the client is started and an initial status message is sent.

  1. OnOpampConnectionSettingsAccepted references OnOpampConnectionSettingsOffer which was renamed OnOpampConnectionSettings

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would ensure that if the currently received message contains more data and not just connection settings all that data will be processed. With the current approach I am not sure what exactly will happen.

onOpampConnectionSettings is the last part of the message processing in client implementation.

If we do this in a separate goroutine like the example agent in the opamp-go library, we have no way of knowing if it will be successful and should be accepted.

Right, the returned value from this callback indicates the reject/accept status of the connection settings. I don't see how we can make this async without changing how the client is expected to handle OnOpampConnectionSettings. Also, we don't report the error as described in OnOpampConnectionSettings open-telemetry/opamp-spec#164.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should clearly state who is responsible for establishing a new connection. I could imagine this being fully implemented in the client library and only have OnOpampConnectionSettings responsible for validating the fields (e.g. endpoint matching a list of acceptable servers or certificate signed by acceptable authority). If this returns nil, the library would attempt to start a new client with the new settings, send an initial status message, and stop and transition to the new connection on success.

I think this was the original intent. However, I am not sure this is the best approach. It is also not how the example implementation works and I don't see good arguments against how the example is implemented today.

To summarize this is what the example is supposed to do:

  1. OnOpampConnectionSettings callback called when new settings are offered by the Server.
  2. The callback implementation first pre-verifies the settings (e.g. check certificates, etc).
  3. If checks in (2) pass the callback implementation logs that it is starting to reconnect, creates a reconnection goroutine and returns from the callback.
  4. Reconnection goroutine stops the Client, creates a new Client with new connection settings and wait for successful OnConnect() callback.
  5. If OnConnect() is not called within a predefined period of time reconnection goroutine assumes the new connection settings are bad, reverts to the old connection settings and re-creates the Client again.
  6. We get rid of OnOpampConnectionSettingsAccepted(), it is not needed.

The example implementation of steps 4 and 5 is not complete today (e.g not waiting for OnConnect), but can be modified to match what I described.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created an issue for this open-telemetry/opamp-go#261

Let's also discuss today in our call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, open-telemetry/opamp-go#266 removes OnOpampConnectionSettingsAccepted callback.

Copy link
Member Author

@srikanthccv srikanthccv Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the implementation to run in a separate goroutine and use OnConnect to determine the connection status.

Copy link
Contributor

@MovieStoreGuy MovieStoreGuy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much to add here, I don't see any potential issues here and @tigrannajaryan has more context here.

ReportsOwnMetrics *bool `mapstructure:"reports_own_metrics"`
ReportsHealth *bool `mapstructure:"reports_health"`
ReportsRemoteConfig *bool `mapstructure:"reports_remote_config"`
AcceptsRemoteConfig *bool `mapstructure:"accepts_remote_config"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I had missed this in earlier, and isn't strictly relevant for the PR but I am curious, what does a pointer value add here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the zero-value for bool is false but the default value may be true, using a pointer instead allows us to use a nil pointer value as a way to represent when the user didn't pass a value in and a default should be used.


newServerConfig := &config.OpAMPServer{}

if settings.DestinationEndpoint != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there any value in using some of the confmap work to merge these values together instead of needing to update each one explicitly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very familiar with confmap work. I will take a look at it and If it helps here I can send a separate PR for it.

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Mar 20, 2024
mx-psi pushed a commit that referenced this pull request Mar 25, 2024
**Description:** <Describe what has changed.>
- Linting Errors: 

[https://productionresultssa5.blob.core.windows.net/actions-results/44e4093a-f4c1-4e35-af4f-d630ea9b8d68/workflow-job-run-071cf28a-9760-544e-287f-c0f9ae5a03a4/logs/job/job-logs.txt?rsct=text%2Fplain&se=2024-03-25T12%3A45%3A21Z&sig=fRzhKwEpzDQDQZyUYwvEY%2BhkcFWOD3IY0EeCOE47AeU%3D&sp=r&spr=https&sr=b&st=2024-03-25T12%3A35%3A16Z&sv=2021-12-02](url)

- some linting errors will be fixed by 
#30237 


**Link to tracking Issue:** <Issue number if applicable>
- #31240 

**Testing:** <Describe what testing was performed and which tests were
added.>

**Documentation:** <Describe the documentation added.>
@github-actions github-actions bot removed the Stale label Mar 29, 2024
@MovieStoreGuy
Copy link
Contributor

Could I ask you to clean up the mod file?

@MovieStoreGuy MovieStoreGuy merged commit 4edca20 into open-telemetry:main Apr 9, 2024
169 of 170 checks passed
@github-actions github-actions bot added this to the next release milestone Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants