Skip to content

feat(apm): aggregate and flush data accordingly for DSM, Profiling, LLMObs, and Live Debugger #737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

duncanista
Copy link
Contributor

@duncanista duncanista commented Jul 10, 2025

What?

Creates an aggregator and flusher for APM proxy data.

Motivation

With the introduction of #684, the Extension is way faster to continue invocations, I think this could have affected how data is aggregated for this various APM products in which we were just sending the request to intake as soon as it arrives.

Now, we aggregate it and respect the flushing pattern and algorithm.

Testing

  1. Profiling:
Screenshot 2025-07-11 at 11 13 55 AM
  1. LLMObs:
  2. DSM:
  3. Live Debugger:
image

@duncanista duncanista requested a review from a team as a code owner July 10, 2025 15:13
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't like how simple this aggregator is, its just a wrapper for a queue, but I think so far we should respect the way we are following this pattern, maybe we can create a trait and consolidate how we create aggregators

@duncanista duncanista requested a review from Copilot July 11, 2025 18:17
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a buffering layer for APM proxy requests in the Trace Agent, introducing an in-memory aggregator and a flusher that batches and retries HTTP calls for DSM, Profiling, LLMObs, and Live Debugger.

  • Introduce proxy_aggregator and proxy_flusher modules to buffer and periodically flush requests.
  • Update TraceAgent endpoints to enqueue proxy requests instead of forwarding immediately.
  • Adjust get_client signature to take a reference and propagate that change to related flushers.

Reviewed Changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
bottlecap/src/traces/trace_agent.rs Switched proxy endpoints to enqueue into proxy_aggregator
bottlecap/src/traces/proxy_aggregator.rs Added ProxyRequest struct and Aggregator for request buffering
bottlecap/src/traces/proxy_flusher.rs Added Flusher to batch, send, and retry buffered proxy requests
bottlecap/src/traces/mod.rs Registered new modules and added DD_ADDITIONAL_TAGS_HEADER const
bottlecap/src/logs/flusher.rs Updated get_client call to pass a reference
bottlecap/src/http.rs Changed get_client and build_client to accept &Arc<Config>
Comments suppressed due to low confidence (1)

bottlecap/src/traces/proxy_flusher.rs:26

  • [nitpick] There are no unit tests for the new Flusher behavior. Please add tests covering header merging, batch flushing, retry logic, and failure scenarios.
pub struct Flusher {

Copy link
Contributor

@lym953 lym953 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Feel free to request additional review from others if needed.


pub async fn flush(
&self,
retry_requests: Option<Vec<reqwest::RequestBuilder>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my learning: is there a limit on the number of retries? I assumed the retry count in send() is a different things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry count in the send method is mainly for the amount of times we try to retry a payload during a flush.

I think there's no limit on the number of failed payloads we try to retry – this is mainly bc we could have a network error which we need to handle.

For 400/500 errors we don't retry, maybe we should at some point, but we want to ensure we don't timeout the lambda or fill the buffer with data that will never arrive

@duncanista duncanista force-pushed the jordan.gonzalez/trace-agent/aggregation-for-proxy branch from 4c1162d to 07aa341 Compare July 14, 2025 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants