Description
openedon Mar 21, 2023
Based on feedback from users, especially larger ones, I want to create this issue to discuss a possible way of better handling our scaling for SQS.
This does not assume we have the capabilities today to perform this, and is to research and discuss further
Common SQS workflow behavior
- A single thread handles the communication with SQS.
- First it will run a ReceiveMessage (with the count configured for max_number_of_messages) to retrieve the next messages in queue.
- For each message retrieved, we run HideVisibilityTimeout, that hides the message from other beat instances that also retrieves messages from the Queue.
- This timeout is 300 seconds, if the S3 object is still being processed when 150s has passed, it will refresh the 300s.
- Once the object has been processed, we send a DeleteMessage to SQS to remove it.
Now let's apply this logic to a usecase:
Large scale example
Usecase1:
A user is storing data from a datasource on S3. This specific data source usually stores a very small amount of events, but over a large amount of S3 objects. This means that even if a single filebeat has no problem handling the amount of objects coming in, its not fast enough to retrieve them from the queue.
Increasing the max_number_of_messages
helps in this case, but the default values would have been too low. However if we set the default value too high, it will break the usecase below
Usecase2:
A user is storing data in S3. The specific data source creates 50MB objects (each event being around 200-400 bytes), when retrieving more than a few SQS messages at a time, what will happen is that the underlying filebeat will take a long time processing these objects.
While these are being processed, if you have a higher max_number_of_messages
(even the default 5 is sometime to high), the HideVisibilityTimeout
on the other objects waiting to be processed by the same filebeat instance is going to be refreshing for minutes (or even hours), which makes them hidden from other concurrent instances of filebeat that is used to scale SQS ingest horizontally.
This can add latency of minutes or hours, and make the load balancing on SQS really bad.
Solution
I want to provide a better user experience for users, to prevent as many usecases as possible (niche usecases we can't cover will always happen).
The solution should be:
- a togglable option, that defaults to false, that overrides the
max_number_of_messages
dynamically. - It should start at 5
max_number_of_messages
, minimum is 1, and max is 100 - We need to determine either how long a worker is spending on an object, or how large an object is, and find some good measurements to determine if we should change
max_number_of_messages
Some metrics we do have access to, that could help determine this:
https://www.elastic.co/guide/en/beats/filebeat/master/filebeat-input-aws-s3.html#_metrics
- sqs_visibility_timeout_extensions_total (How many HideVisibilityTimeouts used)
- sqs_lag_time (How far behind we are)
- s3_object_processing_time (Histogram of the elapsed S3 object processing times in nanoseconds (start of download to completion of parsing).