Skip to content
This repository has been archived by the owner on Apr 2, 2024. It is now read-only.

Improvements to batching #1739

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from
Draft

Improvements to batching #1739

wants to merge 6 commits into from

Conversation

cevian
Copy link
Contributor

@cevian cevian commented Nov 3, 2022

Description

Merge requirements

Please take into account the following non-code changes that you may need to make with your PR:

  • CHANGELOG entry for user-facing changes
  • Updated the relevant documentation

This change adds a new reservation system for metric batching.
The reservation system forms a queue of metric batchers for
copiers to pick up. The queue is ordered by the time a request
hits Promscale. That ordering allows copiers to combine a
request together better.

This solves a few problems:
1. Previously the ordering of reservations was based on a channel
and the ordering was based on when the batcher first made the reservation.
While this may correlate with the time the request hit the connector,
this ordering is less exact to what we actually want. This solves
that problem by ordering explicitly on request start time.

2. It allows for Peeking into info associated with the first reservation.
That is not possible with a channel-based approach and is used here for
ensuring that we don't stop batching until a min duration havs passed since the
oldest request in the batch hit the system. That operates as an alternative
backpressure mechanism to the regular mechanism of copiers becoming
available. While in most cases it shouldn't be necessary, because
of the sharding adjustment Prometheus does, it can be useful in some
cases. Note that the minimum latency is measured from the time the request
hits the connector and NOT the time taken in the batcher. This better controls
the latency penalty of this alternate backpressure system.

Peeking can also be used to have more advanced logic for batch sizing.
For example, we may want to batch based on number of samples. Without
peeking, you'd need to pop the queue before learning the number of
samples, in which case it is too late.
Should give us better observability.
- Significantly increase number of samples batched per metric. Heavy
  metrics are very likely to come from many requests, so flushing
  them sooner is better. Having to wait multiple times to flush is
  no bueno.
- To prevent copy batches from getting to large in the face of potentially
  more samples/metric change that batch criteria to batch not just
  with a limit to # metrics but also a limit of # samples.
- change preference from try to flush before batching to try to batch before
  flushing. This should increase batch sizes
- For two metric with the same initial request time, flush the bigger one
  first, hoping the smaller one can become bigger before flush.
Previously, when a new set of requests came in, we often
flushed a very small first batch because we flush as soon
as any data becomes available in Peek. Since small batches
are ineffecient we fix this by waiting till the entire first
request is batched to flush (with a timeout).
@cevian cevian force-pushed the batch_reso branch 2 times, most recently from 05df804 to 2c36b75 Compare December 13, 2022 21:44
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant