Description
What
Move the concurrency for flushing metrics from per-flush to per-batch.
The expected architecture is one goroutine doing the following operations:
- Fetch the buckets from the buckets queue
- Split time series into batches
- Encoding as protobuf
- Enqueue the batch as a job to be pushed to the remote service
And a series of concurrent goroutines doing the following operations:
- Fetch a job
- Invoke the
metricsClient.push
operation
Why
We have seen not optimal handling when we hit tests with lot of active time series (> 100k). The flush operation will split them in batches and then pushes them sequentially, doing some math like the following, it is to see why we could hit some >10s per single flush operation.
Example
100k time series
1k time series as batch limit
that generates 100 batches
in the case, we don't have perfect networking (e.g 100ms per request) then we will end with a total of 10 seconds for flushing a single iteration of 100k active series (100 batches * 100 ms), and it can even grow with worst cases.