Skip to content

feat(control-plane): add support for handling multiple events in a single invocation #4603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

iainlane
Copy link
Contributor

@iainlane iainlane commented May 29, 2025

⚠️ This hasn't been run in reality yet, but will be soon.⚠️

Currently we restrict the scale-up Lambda to only handle a single event at a time. In very busy environments this can prove to be a bottleneck: there are calls to GitHub and AWS APIs that happen each time, and they can end up taking long enough that we can't process job queued events faster than they arrive.

In our environment we are also using a pool, and typically we have responded to the alerts generated by this (SQS queue length growing) by expanding the size of the pool. This helps because we will more frequently find that we don't need to scale up, which allows the lambdas to exit a bit earlier, so we can get through the queue faster. But it makes the environment much less responsive to changes in usage patterns.

At its core, this Lambda's task is to construct an EC2 CreateFleet call to create instances, after working out how many are needed. This is a job that can be batched. We can take any number of events, calculate the diff between our current state and the number of jobs we have, capping at the maximum, and then issue a single call.

The thing to be careful about is how to handle partial failures, if EC2 creates some of the instances we wanted but not all of them. Lambda has a configurable function response type which can be set to ReportBatchItemFailures. In this mode, we return a list of failed messages from our handler and those are retried. We can make use of this to give back as many events as we failed to process.

Now we're potentially processing multiple events in a single Lambda, one thing we should optimise for is not recreating GitHub API clients. We need one client for the app itself, which we use to find out installation IDs, and then one client for each installation which is relevant to the batch of events we are processing. This is done by creating a new client the first time we see an event for a given installation.

We also remove the same batch_size = 1 constraint from the job-retry Lambda. This Lambda is used to retry events that previously failed. However, instead of reporting failures to be retried, here we maintain the pre-existing fault-tolerant behaviour where errors are logged but explicitly do not cause message retries, avoiding infinite loops from persistent GitHub API issues or malformed events.

Tests are added for all of this.

@iainlane iainlane force-pushed the iainlane/many-events branch 3 times, most recently from a7720aa to 9056deb Compare May 29, 2025 12:45
@npalm npalm self-requested a review May 30, 2025 07:48
@iainlane iainlane force-pushed the iainlane/many-events branch from 9056deb to 0a19f5f Compare June 6, 2025 12:51
…ngle invocation

Currently we restrict the `scale-up` Lambda to only handle a single
event at a time. In very busy environments this can prove to be a
bottleneck: there are calls to GitHub and AWS APIs that happen each
time, and they can end up taking long enough that we can't process
job queued events faster than they arrive.

In our environment we are also using a pool, and typically we have
responded to the alerts generated by this (SQS queue length growing) by
expanding the size of the pool. This helps because we will more
frequently find that we don't need to scale up, which allows the lambdas
to exit a bit earlier, so we can get through the queue faster. But it
makes the environment much less responsive to changes in usage patterns.

At its core, this Lambda's task is to construct an EC2 `CreateFleet`
call to create instances, after working out how many are needed. This is
a job that can be batched. We can take any number of events, calculate
the diff between our current state and the number of jobs we have,
capping at the maximum, and then issue a single call.

The thing to be careful about is how to handle partial failures, if EC2
creates some of the instances we wanted but not all of them. Lambda has
a configurable function response type which can be set to
`ReportBatchItemFailures`. In this mode, we return a list of failed
messages from our handler and those are retried. We can make use of this
to give back as many events as we failed to process.

Now we're potentially processing multiple events in a single Lambda, one
thing we should optimise for is not recreating GitHub API clients. We
need one client for the app itself, which we use to find out
installation IDs, and then one client for each installation which is
relevant to the batch of events we are processing. This is done by
creating a new client the first time we see an event for a given
installation.

We also remove the same `batch_size = 1` constraint from the `job-retry`
Lambda and make it configurable instead, using AWS's default if not
configured. This Lambda is used to retry events that previously failed.
However, instead of reporting failures to be retried, here we maintain
the pre-existing fault-tolerant behaviour where errors are logged but
explicitly do not cause message retries, avoiding infinite loops from
persistent GitHub API issues or malformed events.

Tests are added for all of this.
@iainlane iainlane force-pushed the iainlane/many-events branch from cc702e9 to 3e9760c Compare June 6, 2025 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant