receive: use async remote writing

Instead of spawning new goroutines for each peer that we want to remote write to, spawn a fixed number of worker goroutines and then schedule work on top of them. This has reduced the number of goroutines in our case about 10x-20x and the 99p of forwarding dropped from ~30s to just a few hundred milliseconds. Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
thanos-io · Jan 24, 2024 · e4b6e99 · e4b6e99
1 parent 4a73fc3
commit e4b6e99
Show file tree

Hide file tree

Showing 4 changed files with 228 additions and 76 deletions.
diff --git a/cmd/thanos/receive.go b/cmd/thanos/receive.go
@@ -831,6 +831,8 @@ type receiveConfig struct {
 	writeLimitsConfig       *extflag.PathOrContent
 	storeRateLimits         store.SeriesSelectLimits
 	limitsConfigReloadTimer time.Duration
+
+	asyncForwardWorkerCount uint
 }
 
 func (rc *receiveConfig) registerFlag(cmd extkingpin.FlagClause) {
@@ -888,6 +890,7 @@ func (rc *receiveConfig) registerFlag(cmd extkingpin.FlagClause) {
 
 	cmd.Flag("receive.replica-header", "HTTP header specifying the replica number of a write request.").Default(receive.DefaultReplicaHeader).StringVar(&rc.replicaHeader)
 
+	cmd.Flag("receive.forward.async-workers", "Number of concurrent workers processing forwarding of remote-write requests.").Default("5").UintVar(&rc.asyncForwardWorkerCount)
 	compressionOptions := strings.Join([]string{snappy.Name, compressionNone}, ", ")
 	cmd.Flag("receive.grpc-compression", "Compression algorithm to use for gRPC requests to other receivers. Must be one of: "+compressionOptions).Default(snappy.Name).EnumVar(&rc.compression, snappy.Name, compressionNone)
 

diff --git a/docs/components/receive.md b/docs/components/receive.md
@@ -248,6 +248,14 @@ NOTE:
 - Thanos Receive performs best-effort limiting. In case meta-monitoring is down/unreachable, Thanos Receive will not impose limits and only log errors for meta-monitoring being unreachable. Similarly to when one receiver cannot be scraped.
 - Support for different limit configuration for different tenants is planned for the future.
 
+## Asynchronous workers
+
+Instead of spawning a new goroutine each time the Receiver forwards a request to another node, it spawns a fixed number of goroutines (workers) that perform the work. This allows avoiding spawning potentially tens or even hundred thousand goroutines if someone starts sending a lot of small requests.
+
+This number of workers is controlled by `--receive.forward.async-workers=`.
+
+Please see the metric `thanos_receive_forward_delay_seconds` to see if you need to increase the number of forwarding workers.
+
 ## Flags
 
 ```$ mdox-exec="thanos receive --help"
@@ -308,6 +316,9 @@ Flags:
       --receive.default-tenant-id="default-tenant"
                                  Default tenant ID to use when none is provided
                                  via a header.
+      --receive.forward.async-workers=5
+                                 Number of concurrent workers processing
+                                 forwarding of remote-write requests.
       --receive.grpc-compression=snappy
                                  Compression algorithm to use for gRPC requests
                                  to other receivers. Must be one of: snappy,