-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch the GRPC communication where Agent is running the server and the beats are connecting back to Agent #18973
Conversation
Pinging @elastic/ingest-management (Team:Ingest Management) |
❕ Build Aborted
Expand to view the summary
Build stats
Log outputExpand to view the last 100 lines of log output
|
754b75f
to
5194868
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change looks good with few questions here and there. i'm going to test it before approval
func (*ApplicationStatusHandler) OnStatusChange(state *server.ApplicationState, status proto.StateObserved_Status, msg string) { | ||
app, ok := state.App().(*Application) | ||
if !ok { | ||
panic(errors.New("only *Application can be registered when using the ApplicationStatusHandler", errors.TypeUnexpected)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to crash agent here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should, because using this handler means that only *Application
is registered in the server and if not then we did something wrong in the code
I would rather make it breaking all of agent so we can fix, then allowing it to hide the issue and be harder to debug in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one thing i ran into while testing.
the other thing is when i used standalone and enabled some metrics, then made a config change it got stuck at start
2020-06-05T14:41:12+02:00 DEBUG operator.go:217 running operation 'operation-start' for metricbeat.8.0.0
2020-06-05T14:41:12+02:00 INFO reporter.go:51 2020-06-05T14:41:12+02:00: type: 'STATE': sub_type: 'STARTING' message: Application: metricbeat[d7ba5632-1c93-4ee1-b231-baeaecb94172]: State change: STARTING
expected would be that check is made and start is skipped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skimming through the code, that part I'm most concerned is that we might get into some race conditions because of the many locks and unlocks we needed.
Not necessarly as part of this PR but it would be great to have some tests where many fake clients are started / stopped randomly with different error states to see if a race condition is found or things break.
@blakerouse i was testing this today and it got stuck again when modifying |
@michalpristas I fixed the issues pointed out by @ruflin in my latest commit. I was under the impression that Can you share your agent config so I can test what issue your hitting? |
i'm using default configuration for standalone and changing it on the fly without restarting agent, like changing set of metrics and datasources, enabling disabling monitoring either as a whole or logs/metrics combinations, mostly just commenting uncommenting pieces of default configuration |
@michalpristas I have fixed the issue with deadlocks. I have noticed that disabling monitoring takes the I am going to look into that in a follow up branch, being I think its an issue in metricbeat/filebeat and not in the Agent side. |
started with this config (using default output) datasources:
- namespace: default
use_output: default
inputs:
- type: system/metrics
streams:
- metricset: cpu
dataset: system.cpu started ok beats were running datasources:
- namespace: default
use_output: default
inputs:
- type: system/metrics
streams:
- metricset: cpu
dataset: system.cpu
- metricset: memory
dataset: system.memory
- metricset: network
dataset: system.network
- metricset: filesystem
dataset: system.filesystem failed to apply and beat started crashing
it tried to start another metricbeat to update configuration and then metricbeat failed on
actions resolution does not work properly i guess |
…orting or crashes.
0ec3417
to
5b0c598
Compare
@michalpristas Found the issue and it is now fixed, thanks for the detailed testing! |
one more thing to fix and we're good to go i guess. when going from default config which spins up 1 filebeat and 2 metricbeats (monitoring enabled) settings.monitoring:
enabled: true
logs: true
metrics: false they are correctly stopped.
this is because application is still marked as started (due to latest fix) so Start is skipped |
@michalpristas Believe it's fixed now, but at this rate, I won't be surprised if not. Seems that we need to add even more unit tests, especially to cover the side-car flow as that doesn't seem to have any coverage. |
yes i'm looking forward to integration tests as unit testing this is just for a peace of mind. these parts are even platfrom specific and OS dependent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tested ok
…he beats are connecting back to Agent (elastic#18973) * Update libbeat fleet management to use elastic-agent-client. * Work on switching to the new GRPC. * More on refactor * Add back the state. * Format and check * Add changelog. * Cleanup. * mage fmt in x-pack/libbeat. * Update go.mod. * Fix from review. * Fix NewFromConfig comment * Update go.mod * Fix imports. * Fix some locking issues from review. * Fix lots of issues, add unit testing to cover restarts on failure reporting or crashes. * Update go.sum * Fix TestConfigurableRun. * Fix range over registered apps in GRPC server using RWLock, switch to sync.Map. * Fix config. * Fix tests to work on Windows. * Fix operation start to not start the same application twice. * Fix enabling and disabling of monitoring. (cherry picked from commit 0c15394)
…he beats are connecting back to Agent (#18973) (#19108) * Update libbeat fleet management to use elastic-agent-client. * Work on switching to the new GRPC. * More on refactor * Add back the state. * Format and check * Add changelog. * Cleanup. * mage fmt in x-pack/libbeat. * Update go.mod. * Fix from review. * Fix NewFromConfig comment * Update go.mod * Fix imports. * Fix some locking issues from review. * Fix lots of issues, add unit testing to cover restarts on failure reporting or crashes. * Update go.sum * Fix TestConfigurableRun. * Fix range over registered apps in GRPC server using RWLock, switch to sync.Map. * Fix config. * Fix tests to work on Windows. * Fix operation start to not start the same application twice. * Fix enabling and disabling of monitoring. (cherry picked from commit 0c15394)
…ngs-archive * upstream/master: (119 commits) Update filebeat input docs (elastic#19110) Add ECS fields from log pipeline of PostgreSQL (elastic#19127) Init package libbeat/statestore (elastic#19117) [Ingest Manager] Retryable downloads of beats (elastic#19102) [DOCS] Add output.console to Functionbeat doc and Functionbeat reference file (elastic#18965) Add compatibility info (elastic#18929) Set ecszap version to v0.2.0 (elastic#19106) [filebeat][httpjson] Fix unit test function call (elastic#19124) [Filebeat][httpjson] Adds oauth2 support for httpjson input (elastic#18892) Allow host.* fields to be disabled in Suricata module (elastic#19107) Make selector string casing configurable (elastic#18854) Switch the GRPC communication where Agent is running the server and the beats are connecting back to Agent (elastic#18973) Disable host.* fields by default for netflow module (elastic#19087) Automatically fill zube teams on backports if available (elastic#18924) Fix crash on vsphere module (elastic#19078) [Ingest Manager] Download snapshot artifacts from snapshots repo (elastic#18685) [Ingest Manager] Basic Elastic Agent documentation (elastic#19030) Make user.id a string in system/users, in line with ECS (elastic#19019) [docs] Add 7.8 release highlights placeholder file (elastic#18493) Fix translate_sid's empty target field handling (elastic#18991) ...
…he beats are connecting back to Agent (elastic#18973) * Update libbeat fleet management to use elastic-agent-client. * Work on switching to the new GRPC. * More on refactor * Add back the state. * Format and check * Add changelog. * Cleanup. * mage fmt in x-pack/libbeat. * Update go.mod. * Fix from review. * Fix NewFromConfig comment * Update go.mod * Fix imports. * Fix some locking issues from review. * Fix lots of issues, add unit testing to cover restarts on failure reporting or crashes. * Update go.sum * Fix TestConfigurableRun. * Fix range over registered apps in GRPC server using RWLock, switch to sync.Map. * Fix config. * Fix tests to work on Windows. * Fix operation start to not start the same application twice. * Fix enabling and disabling of monitoring.
What does this PR do?
This completes the switch of the GRPC protocol so that the beat application will connect to Agent using the GRPC server running inside the Agent process. This uses the new protocol that allows bi-directional communication for both configuration, status, and actions.
This also completely removes the old GRPC code and protocol files, which are no longer used or needed.
Why is it important?
To only require 1 listening port on the host machine running the GRPC. To use the new
elastic-agent-client
code to make it easier for new applications to be controlled by Agent.Checklist
I have made corresponding changes to the documentationCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
How to test this PR locally
Related issues
Use cases
Screenshots
Logs