Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc(decision): Batch multiple files together into single large file to improve network throughput #98
base: main
Are you sure you want to change the base?
rfc(decision): Batch multiple files together into single large file to improve network throughput #98
Changes from 11 commits
912301d
7dfe76c
f82f9e6
721c28c
4ff4376
4ec773a
86c1819
9f9d123
8b25e3b
d34401e
8af3d50
812a110
f5fe8e8
78a0d6e
82f222d
a7ad1d6
d63a30a
354431d
15430b4
71f1715
3759346
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the cost is mostly due to write, at scale is this cost problematic? Like is it a blocker for reaching higher scale or are we simply looking for a more efficient option ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not believe cost will prevent us from reaching greater scale. Write cost scale linearly. Pricing does not quite scale linearly but if you're happy with how much GCS costs at low levels of demand you will be happy at peak demand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this buffering process going to look like?
It will have to rely on persistent state to prevent you from loosing chunks in case of failover before a file is fully ingested while parts have been already committed in Kafka.
Also a more high level concern with this.
If the idea is to build a generic system reusable by other features, relying on a specific consumer to do the buffering has the important implication that Kafka is always going to be a moving part of your system. Have you considered doing the buffering separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like buffering would in our Snuba consumers. You keep an array of messages/rows/bytes in memory. When it comes time to flush you zip them together in some protocol specific format. In this case the protocol is byte concatenation.
If we assume that the replays consumer pushes the bytes to another, generic Kafka consumer which buffers the files before upload then the persistent state will be the log. Upload failures can be re-run from the last committed offset. Persistent failures would have to be handled as part of a DLQ and would require re-running. Potentially introducing a significant amount of latency between (in the replays case) a billing outcome and the replay recording being made available.
Assuming this buffering/upload step exists inside our existing consumer (i.e. not a generic service) then offsets will not be committed until after the batch has been uploaded.
I have considered that. The idea being process A buffers a bunch of file-parts/offsets before sending the completed object to permanent storage either through direct interaction, filestore, or Kafka intermediary (or any number of intermediaries). The problem is then that each call site is responsible for the buffer which is the hardest part of the problem.
Kafka is an important part of this system in my mind. When I wrote this document I relied on the guarantees it provides. I think if this is a generic service made available to the company then publishing to the topic should be a simple process for integrators. The fact that its Kafka internally is an irrelevant implementation detail (to the caller).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this will constrain you to a fairly small buffer size.
Also it ties the number of replicas of your consumer to the cost efficiency of the storage, which is quite undesirable:
Assuming you are never going to commit on kafka untill the buffer is flushed (if you did you would not be able to guarantee at least once):
replicas and file size should not be connected to each other, otherwise we will have to keep in mind a lot of affected moving parts when scaling the consumer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the time to accumulate a batch is less than the time to upload a batch then you need to add a replica. That's the only constraint. You get more efficiency at peak load so its best to run our replicas hot. The deadline will prevent the upload from sitting idle too long. The total scale factor will be determined by the number of machines we can throw at the problem.
Multi-processing/threading, I think, will be deadly to this project. So we will need a lot of single-threaded machines running.
I re-wrote this response several times. Its as disorganized as my thoughts are on this. Happy to hear critiques.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an okay outcome. If you double the number of replicas you halve the number of parts per file and double the number of files. That reduces cost efficiency but the throughput efficiency remains the same for each replica. Ignoring replica count, cost efficiency will ebb and flow with the variations in load we receive throughout the day.
We still come out ahead because the total number of files written per second is less than the current implementation which is 1 file per message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should scale our replicas agnostic to the implementation of the buffer's flush mechanics. I mentioned above about cost-efficiency being a hard target. So I don't think we should target it.
A deadline should be present to guarantee regular buffer commits and a max buffer size should exist to prevent us from using too many resources. I think those two commit semantics save us from having to think about the implications of adding replicas. The throughput of a single machine may drop but the risk of back log has decreased across the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the target size of each file you want to write and the target buffer size ?
The approach of throwing away the in flight work is generally ok if the buffer are very small and the time it takes to reprocessing is negligible. If this is not the case (you want large files) you may create a lot of noise when kafka rebalances consumer groups and create a lot of backlog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the commit criterion I have (naively) envisioned. All these values are adjustable.
10MB and 1000 parts can be scaled down to 1MB and 100 parts if they seem unrealistic. Any lower and the concept, I think, has run its course and is not worth pursuing.
I believe this will be the case. A generic consumer implementation should only be doing byte concatenation. But depending on size it may take a while to fetch those files over the network to even begin buffering.
I would prefer this generic consumer deployed independently of getsentry so re-balances are less common.