Remove mmap usage in binary index header reader #6070

GiedriusS · 2023-01-24T13:51:45Z

Problem is that Thanos Store reads binary index headers using mmap. Goroutine scheduler is blocked when the whole process is blocked during reading. Most of our friends like Mimir/VictoriaMetrics have already switched to using good, old seek/read so let's do that as well. This allows the goroutine scheduler to schedule something else while it is blocked by read/seek.

kon3m · 2023-01-24T17:54:18Z

I would like to work on this issue.

bwplotka · 2023-01-25T09:49:53Z

2c: This is only theory, I personally never see us being impacted by this, but also I never benchmarked for concurrency latency like this. What I am saying - if new solution will be both slower and use more RAM, then it might be not a good change. Still worth to experiment on it with learnings from @pracucci (:

BTW RAM will be higher (a lot!) as mmap is not tracked by go_mem metrics AND typical cadvisor RSS metric (should be tracked by container working bytes though).

pracucci · 2023-01-25T09:57:15Z

This is only theory

Running Mimir at Grafana Labs this was also practice. We suffered this issue since more than a year, but the bigger the scale the worse the situation went. Removing mmap fixed it. At some point we've been able to prove it slowing down the filesystem using fuse. However, it took us as an huge effort just to prove it.

BTW RAM will be higher (a lot!) as mmap is not tracked by go_mem metrics AND typical cadvisor RSS metric (should be tracked by container working bytes though).

Surprise, surprise... it's not. Now we're running store-gateway without mmap in all production clusters and, no, store-gateway memory is not higher and store-gateway latency is not higher.

Just to be clear, removing mmap doesn't mean you load the whole index-header in memory. We rewrote all Symbols reader, DecBuf & co in a streaming way. We never read everything in memory. We just read batch-by-batch, reusing buffers, pooling file descriptions, so if you make it properly memory is not higher.

alanprot · 2023-01-25T16:58:57Z

We saw huge improvements on compactor time only giving it way more memory - even though the compactor process was using only 25% of the available memory- allowing the os to cache more files and so reduce those pause times (when fetching mapped data from disk). We could also see big improvements on compaction time if we mlock the symbols table.

We never saw the blocked time on the store gateway (maybe we just did not look deep enough) but we definitely saw this problems in other places on the Tsdb code.

I think using buffer pools can improve the block times but also to give more control on the cache eviction policy.

pracucci · 2023-01-27T12:07:00Z

If Thanos community is interested into it, we're open to upstream our no-mmap readers back to Prometheus.

480132 · 2024-03-06T17:52:40Z

I think that upstream is a good idea.

GiedriusS added feature request/improvement component: store labels Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove mmap usage in binary index header reader #6070

Remove mmap usage in binary index header reader #6070

GiedriusS commented Jan 24, 2023

kon3m commented Jan 24, 2023

bwplotka commented Jan 25, 2023

pracucci commented Jan 25, 2023

alanprot commented Jan 25, 2023

pracucci commented Jan 27, 2023

480132 commented Mar 6, 2024

Remove mmap usage in binary index header reader #6070

Remove mmap usage in binary index header reader #6070

Comments

GiedriusS commented Jan 24, 2023

kon3m commented Jan 24, 2023

bwplotka commented Jan 25, 2023

pracucci commented Jan 25, 2023

alanprot commented Jan 25, 2023

pracucci commented Jan 27, 2023

480132 commented Mar 6, 2024