Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow chunks that are larger than the maximum chunk size in index construction #303

Open
jorisdral opened this issue Jul 21, 2024 · 1 comment

Comments

@jorisdral
Copy link
Collaborator

When incrementally constructing a compact/ordinary index and the max chunk size is exceeded, then currently we might split up the output into multiple chunks if the serialised form of the index has a size that is mutiple of the max chunk size. It might be worth it to just return a larger chunk in that case (maybe still a multiple of the max chunk size), since we don't really rely on chunks being a specific size anywhere in the RunBuilder/RunAcc code

@jorisdral jorisdral changed the title Do not split up incremental serialisation output into multiple Chunks Allow Chunks that are large than the max chunk size in index construction Jul 21, 2024
@jeltsch jeltsch changed the title Allow Chunks that are large than the max chunk size in index construction Allow chunks that are larger than the maximum chunk size in index construction Jul 30, 2024
@jeltsch
Copy link
Collaborator

jeltsch commented Jul 30, 2024

As we agreed in our project meeting, this seems to be the way to go indeed. Concretely, we concluded in the meeting that, whenever the size of the buffered serialized data exceeds a certain threshold, all available data should be output in form of a single chunk. This particularly means the following:

  • Chunk sizes don’t have to be multiples of the threshold.
  • The threshold is not a maximum chunk size anymore but rather a minimum chunk size.
  • There is no maximum chunk size.

The last point is justified, because in practice chunks will still not become so large that the work of writing the index isn’t appropriately spread over time, which is particularly because serialized keys aren’t expected to be large.

With this new approach, the output of appending to an index shouldn’t have type [Chunk] anymore but rather type Maybe Chunk, as there can be at most one chunk only.

I will implement this new approach of chunk generation already as part of #296 and #299. For the compact index, it remains to be implemented (potentially by using the general-purpose chunk handling to be added by #296).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants