-
Notifications
You must be signed in to change notification settings - Fork 13
feat(apm): aggregate and flush data accordingly for DSM, Profiling, LLMObs, and Live Debugger #737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(apm): aggregate and flush data accordingly for DSM, Profiling, LLMObs, and Live Debugger #737
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't like how simple this aggregator is, its just a wrapper for a queue, but I think so far we should respect the way we are following this pattern, maybe we can create a trait and consolidate how we create aggregators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a buffering layer for APM proxy requests in the Trace Agent, introducing an in-memory aggregator and a flusher that batches and retries HTTP calls for DSM, Profiling, LLMObs, and Live Debugger.
- Introduce
proxy_aggregator
andproxy_flusher
modules to buffer and periodically flush requests. - Update
TraceAgent
endpoints to enqueue proxy requests instead of forwarding immediately. - Adjust
get_client
signature to take a reference and propagate that change to related flushers.
Reviewed Changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
bottlecap/src/traces/trace_agent.rs | Switched proxy endpoints to enqueue into proxy_aggregator |
bottlecap/src/traces/proxy_aggregator.rs | Added ProxyRequest struct and Aggregator for request buffering |
bottlecap/src/traces/proxy_flusher.rs | Added Flusher to batch, send, and retry buffered proxy requests |
bottlecap/src/traces/mod.rs | Registered new modules and added DD_ADDITIONAL_TAGS_HEADER const |
bottlecap/src/logs/flusher.rs | Updated get_client call to pass a reference |
bottlecap/src/http.rs | Changed get_client and build_client to accept &Arc<Config> |
Comments suppressed due to low confidence (1)
bottlecap/src/traces/proxy_flusher.rs:26
- [nitpick] There are no unit tests for the new
Flusher
behavior. Please add tests covering header merging, batch flushing, retry logic, and failure scenarios.
pub struct Flusher {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Feel free to request additional review from others if needed.
|
||
pub async fn flush( | ||
&self, | ||
retry_requests: Option<Vec<reqwest::RequestBuilder>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my learning: is there a limit on the number of retries? I assumed the retry count in send()
is a different things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The retry count in the send
method is mainly for the amount of times we try to retry a payload during a flush.
I think there's no limit on the number of failed payloads we try to retry – this is mainly bc we could have a network error which we need to handle.
For 400/500 errors we don't retry, maybe we should at some point, but we want to ensure we don't timeout the lambda or fill the buffer with data that will never arrive
no need to pass a whoole clone here
only acknowledges data and lets the flusher take care of the rest
4c1162d
to
07aa341
Compare
What?
Creates an aggregator and flusher for APM proxy data.
Motivation
With the introduction of #684, the Extension is way faster to continue invocations, I think this could have affected how data is aggregated for this various APM products in which we were just sending the request to intake as soon as it arrives.
Now, we aggregate it and respect the flushing pattern and algorithm.
Testing