[Feature Request] Chunk Compactor

**Is your feature request related to a problem? Please describe.**

Too many small chunks in S3, unable to be solved by the continued increase of idle timeout due to the huge memory increase that settings results in.

With some queries needing to fetch 90,000 chunks 50-100 big chunks, 89900+ smaller chunks these smaller chunks can be the bottleneck for many queries. Quite often these smaller chunks exist because their source has bursts of activity infrequently. It would be far more ideal if this was <1,000 good sized chunks (still enough to parallize over multiple cores) were queried instead (closer to the number of streams).

**Describe the solution you'd like**

A utility similar to compactor (or built in?) that is able to create new chunks by merging small chunks (i.e <10KB which is 95%+ of our dataset) that had been pushed due to idle period (but later there was matching data).

Fetching these chunks is particularly expensive and most of the time spent downloading chunks. It might also improve compression ratios (if blocks are rebuilt).

Placing this in compactor might be a good idea since the index is already being updated at this time.

This compactor should get a setting like `sync_period` to bound the combine search. For most people this should be the same value as indexers `sync_period`. Chunk max size would still need to be honoured of course. Larger chunks, not just one chunk.

Something like:
```
if chunk size < min_threshold:
   for each chunk in index that also matches labels:
      merge new chunk
      if new chunk size > max_merge_amount or in new sync_period then
         replace with new chunk
```

New chunks should be entirely new (new id) and old chunks removed `index_cache_validity` after the index containing only the new chunks is updated (to prevent cached indexes from accessing the now non existant chunks). 

If the chunk compactor exits uncleanly (or has any similar issue) unreferenced chunks may end up in the chunk store. AFAIK this is possible currently regardless and probably is a seperate matter.

**Describe alternatives you've considered**

Increasing `chunk_idle_period` (currently 6m) further. 10m was tested however resulted in too much memory being consumed.

**Screenshot showing issue**
1 week retention
![chrome_2022-03-11_18-19-41](https://user-images.githubusercontent.com/468579/157820813-a9a182dd-ed04-40f4-8329-ee7f5dbbc0c0.png)


**May resolve**
 #1258, #4296

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Chunk Compactor
#5605

splitice
openedon Mar 11, 2022

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Chunk Compactor#5605

Description

spliticeopenedon Mar 11, 2022

Metadata