feat(control-plane): add support for handling multiple events in a single invocation #4603

iainlane · 2025-05-29T12:13:55Z

Currently we restrict the scale-up Lambda to only handle a single event at a time. In very busy environments this can prove to be a bottleneck: there are calls to GitHub and AWS APIs that happen each time, and they can end up taking long enough that we can't process job queued events faster than they arrive.

In our environment we are also using a pool, and typically we have responded to the alerts generated by this (SQS queue length growing) by expanding the size of the pool. This helps because we will more frequently find that we don't need to scale up, which allows the lambdas to exit a bit earlier, so we can get through the queue faster. But it makes the environment much less responsive to changes in usage patterns.

At its core, this Lambda's task is to construct an EC2 CreateFleet call to create instances, after working out how many are needed. This is a job that can be batched. We can take any number of events, calculate the diff between our current state and the number of jobs we have, capping at the maximum, and then issue a single call.

The thing to be careful about is how to handle partial failures, if EC2 creates some of the instances we wanted but not all of them. Lambda has a configurable function response type which can be set to ReportBatchItemFailures. In this mode, we return a list of failed messages from our handler and those are retried. We can make use of this to give back as many events as we failed to process.

Now we're potentially processing multiple events in a single Lambda, one thing we should optimise for is not recreating GitHub API clients. We need one client for the app itself, which we use to find out installation IDs, and then one client for each installation which is relevant to the batch of events we are processing. This is done by creating a new client the first time we see an event for a given installation.

We also remove the same batch_size = 1 constraint from the job-retry Lambda. This Lambda is used to retry events that previously failed. However, instead of reporting failures to be retried, here we maintain the pre-existing fault-tolerant behaviour where errors are logged but explicitly do not cause message retries, avoiding infinite loops from persistent GitHub API issues or malformed events.

Tests are added for all of this.

Tests in a private repo (sorry) look good. This was running ephemeral runners with no pool, SSM high throughput enabled, the job queued check _dis_abled, batch size of 200, wait time of 10 seconds. The workflow runs are each a matrix with 250 jobs.

modules/runners/scale-up.tf

iainlane · 2025-06-13T18:52:05Z

lambdas/libs/aws-powertools-util/src/logger/index.ts


  // Add the context to all child loggers
  childLoggers.forEach((childLogger) => {
-    childLogger.addPersistentLogAttributes({


I was getting warnings about mixing this with persistentKeys. The function says

/** * @deprecated This method is deprecated and will be removed in the future major versions, please use {@link appendPersistentKeys() `appendPersistentKeys()`} instead. */ addPersistentLogAttributes(attributes: LogKeys): void;

so I've switched it!

lambdas/functions/control-plane/src/lambda.ts

…ngle invocation Currently we restrict the `scale-up` Lambda to only handle a single event at a time. In very busy environments this can prove to be a bottleneck: there are calls to GitHub and AWS APIs that happen each time, and they can end up taking long enough that we can't process job queued events faster than they arrive. In our environment we are also using a pool, and typically we have responded to the alerts generated by this (SQS queue length growing) by expanding the size of the pool. This helps because we will more frequently find that we don't need to scale up, which allows the lambdas to exit a bit earlier, so we can get through the queue faster. But it makes the environment much less responsive to changes in usage patterns. At its core, this Lambda's task is to construct an EC2 `CreateFleet` call to create instances, after working out how many are needed. This is a job that can be batched. We can take any number of events, calculate the diff between our current state and the number of jobs we have, capping at the maximum, and then issue a single call. The thing to be careful about is how to handle partial failures, if EC2 creates some of the instances we wanted but not all of them. Lambda has a configurable function response type which can be set to `ReportBatchItemFailures`. In this mode, we return a list of failed messages from our handler and those are retried. We can make use of this to give back as many events as we failed to process. Now we're potentially processing multiple events in a single Lambda, one thing we should optimise for is not recreating GitHub API clients. We need one client for the app itself, which we use to find out installation IDs, and then one client for each installation which is relevant to the batch of events we are processing. This is done by creating a new client the first time we see an event for a given installation. We also remove the same `batch_size = 1` constraint from the `job-retry` Lambda and make it configurable instead, using AWS's default of 10 for SQS if not configured. This Lambda is used to retry events that previously failed. However, instead of reporting failures to be retried, here we maintain the pre-existing fault-tolerant behaviour where errors are logged but explicitly do not cause message retries, avoiding infinite loops from persistent GitHub API issues or malformed events. Tests are added for all of this.

Some small tweaks to the style of `ScaleError`, while we're working on it. `message` doesn't need to be a public property. `Error` already has it. Likewise with `stack` - we can use the parent's implementation. Instead of `getDetailedMessage` we can have a property getter which is very slightly nicer to read at callsites (`e.detailedMessage` vs. `e.getDetailedMessage()`).

This function is getting a little complex to read now. Refactoring to use the guard clause pattern makes the flow clearer. Now it's more like a sequence of checks followed by a final `throw`.

… tests This lets us move some logic from `processFleetResult`, simplifying it. We're additionally adding some bounds checking here, and testing it all.

We only construct this once in the production code - there's no need to have it be variable.

Prior to this change, we would have retried if _any_ runner failed to be created either due to capacity or an EC2 error. Now we'll consider these differently. If we can't create due to the maximum allowed runner count, retry only if creating ephemeral runners. Always retry if EC2 creation fails.

The code: ```typescript if (x) { if (y) { // ... } } ``` is equivalent to ```typescript if (x && y) { // ... } ``` the latter is shorter and easier to read, so let's use that.

We were grouping all events we were called for, and only looking at the non-`aws:sqs` events to warn about them. This can be done as we go, and then we only need to keep the SQS events in memory.

Copilot

Pull Request Overview

This PR adds support for handling multiple events in a single Lambda invocation to improve throughput in busy environments. Previously, the scale-up Lambda was restricted to processing only one event at a time, which could create bottlenecks when GitHub and AWS API calls took too long.

Event batching: Enables processing multiple SQS messages in a single Lambda invocation with configurable batch sizes
Partial failure handling: Implements ReportBatchItemFailures to retry only failed messages rather than entire batches
GitHub client optimization: Reuses GitHub API clients across events for the same installation to reduce API overhead

Reviewed Changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
variables.tf	Adds new variables for configuring Lambda event source mapping batch size and batching window
modules/runners/scale-up.tf	Updates Lambda event source mapping to support batch processing with ReportBatchItemFailures
lambdas/functions/control-plane/src/scale-runners/scale-up.ts	Major refactor to handle batched events, optimize GitHub client usage, and return failed message IDs
lambdas/functions/control-plane/src/lambda.ts	Updates Lambda handler to process SQS batches and return batch item failures
lambdas/functions/control-plane/src/scale-runners/ScaleError.ts	Enhanced to support batch failure reporting with failed instance counts

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

lambdas/functions/control-plane/src/scale-runners/scale-up.ts

lambdas/functions/control-plane/src/aws/runners.ts

lambdas/functions/control-plane/src/scale-runners/scale-up.ts

lambdas/functions/control-plane/src/scale-runners/ScaleError.ts

Copilot pointed out correctly that it could be confusing to log `failedInstanceCount` using the key `missingInstanceCount` when there is a variable with that name.

npalm · 2025-09-03T20:03:39Z

@ScottGuymer @stuartp44 wondering do you see also in use case for your deployments?

npalm · 2025-09-03T20:04:39Z

@iainlane for the updates, is the PR ready from your point of view?

iainlane · 2025-09-03T22:41:12Z

@iainlane for the updates, is the PR ready from your point of view?

As far as I’m concerned yeah 👍

npalm · 2025-09-08T21:19:03Z

lambdas/functions/control-plane/src/scale-runners/scale-up.ts

+
+    // Don't call the EC2 API if we can create an unlimited number of runners.
+    const currentRunners =
+      maximumRunners === -1 ? 0 : (await listEC2Runners({ environment, runnerType, runnerOwner: group })).length;


Nice this will partly address issue #4710

npalm · 2025-09-08T21:20:01Z

@ScottGuymer @stuartp44 wondering do you see also in use case for your deployments?

@stuartp44 @ScottGuymer kind reminder would this PR also help you? Any time to run a test on your deployments?

npalm · 2025-09-08T21:21:21Z

@iainlane I ran a quick test, in general I sso no issues. But want to go over some longs before approvvaing the PR. Although it should not impact any of the current usages, the change is significant.

iainlane force-pushed the iainlane/many-events branch 3 times, most recently from a7720aa to 9056deb Compare May 29, 2025 12:45

iainlane commented May 29, 2025

View reviewed changes

modules/runners/scale-up.tf Show resolved Hide resolved

npalm self-requested a review May 30, 2025 07:48

iainlane force-pushed the iainlane/many-events branch 8 times, most recently from d7d1b7c to 54e4a8a Compare June 12, 2025 14:55

iainlane force-pushed the iainlane/many-events branch from 54e4a8a to 79eb6a5 Compare June 13, 2025 18:50

iainlane commented Jun 13, 2025

View reviewed changes

iainlane marked this pull request as ready for review June 13, 2025 18:58

iainlane requested review from a team as code owners June 13, 2025 18:58

npalm requested a review from Copilot June 13, 2025 21:40

This comment was marked as outdated.

Sign in to view

iainlane force-pushed the iainlane/many-events branch from 69ff9b8 to 9443df8 Compare June 16, 2025 10:17

npalm reviewed Jun 22, 2025

View reviewed changes

lambdas/functions/control-plane/src/lambda.ts Outdated Show resolved Hide resolved

iainlane and others added 7 commits September 1, 2025 15:31

handle ScaleError

704b572

fix: correct typo in test comment

8a94963

style: use guard clauses in processFleetResult

95eda18

This function is getting a little complex to read now. Refactoring to use the guard clause pattern makes the flow clearer. Now it's more like a sequence of checks followed by a final `throw`.

refactor: let ScaleError convert itself to batch item failures, add…

022e40f

… tests This lets us move some logic from `processFleetResult`, simplifying it. We're additionally adding some bounds checking here, and testing it all.

refactor: hardcode ScaleError message

83e644b

We only construct this once in the production code - there's no need to have it be variable.

iainlane force-pushed the iainlane/many-events branch from 9443df8 to 83e644b Compare September 1, 2025 14:31

iainlane requested a review from Copilot September 2, 2025 09:10

This comment was marked as outdated.

Sign in to view

iainlane added 3 commits September 2, 2025 12:07

chore: fix tpo nur -> number

8dee5a0

style: combine two nested ifs

1de7c27

The code: ```typescript if (x) { if (y) { // ... } } ``` is equivalent to ```typescript if (x && y) { // ... } ``` the latter is shorter and easier to read, so let's use that.

iainlane requested review from Copilot and npalm September 2, 2025 12:03

This comment was marked as outdated.

Sign in to view

fix: avoid grouping all events

cb5ce97

We were grouping all events we were called for, and only looking at the non-`aws:sqs` events to warn about them. This can be done as we go, and then we only need to keep the SQS events in memory.

iainlane requested a review from Copilot September 2, 2025 12:19

Copilot AI reviewed Sep 2, 2025

View reviewed changes

chore(scale-runners): log failures with same key name as variable name

b5ddb89

Copilot pointed out correctly that it could be confusing to log `failedInstanceCount` using the key `missingInstanceCount` when there is a variable with that name.

npalm added the enhancement New feature or request label Sep 3, 2025

Merge branch 'main' into iainlane/many-events

12a8f3d

Merge branch 'main' into iainlane/many-events

8be2df6

npalm reviewed Sep 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(control-plane): add support for handling multiple events in a single invocation #4603

feat(control-plane): add support for handling multiple events in a single invocation #4603

Uh oh!

iainlane commented May 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

iainlane Jun 13, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

npalm commented Sep 3, 2025

Uh oh!

npalm commented Sep 3, 2025

Uh oh!

iainlane commented Sep 3, 2025

Uh oh!

npalm Sep 8, 2025

Uh oh!

npalm commented Sep 8, 2025

Uh oh!

npalm commented Sep 8, 2025

Uh oh!

Uh oh!

feat(control-plane): add support for handling multiple events in a single invocation #4603

Are you sure you want to change the base?

feat(control-plane): add support for handling multiple events in a single invocation #4603

Uh oh!

Conversation

iainlane commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

iainlane Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

npalm commented Sep 3, 2025

Uh oh!

npalm commented Sep 3, 2025

Uh oh!

iainlane commented Sep 3, 2025

Uh oh!

npalm Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

npalm commented Sep 8, 2025

Uh oh!

npalm commented Sep 8, 2025

Uh oh!

Uh oh!

iainlane commented May 29, 2025 •

edited

Loading