Skip to content

feat: Add last chance check before orphan termination for JIT instances #4595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

stuartp44
Copy link
Contributor

@stuartp44 stuartp44 commented May 22, 2025

This pull request introduces enhancements to the AWS Lambda functions responsible for scaling GitHub Actions runners. Key changes include the addition of functionality to untag runners, support for Just-In-Time (JIT) runner configurations, and improvements to orphan runner handling. The updates also include modifications to tests to validate these new features.

Enhancements to Runner Management:

  • Added a new untag function in lambdas/functions/control-plane/src/aws/runners.ts to remove tags from EC2 instances. This is used to handle orphan runners that are online and busy.
  • Updated terminateOrphan logic in lambdas/functions/control-plane/src/scale-runners/scale-down.ts to differentiate between JIT runners and regular runners. Orphan runners with a valid runnerId are now checked for their state before untagging or terminating.

Support for JIT Runner Configurations:

  • Added runnerId to the RunnerInfo structure to support JIT runner identification. This enables tracking and handling of ephemeral runners created via JIT configurations. [1] [2]
  • Updated test mocks and assertions to include JIT runner scenarios, ensuring proper handling of JIT-configured runners in scaling operations. [1] [2]

Test Suite Updates:

  • Added new test cases in scale-down.test.ts to verify the behavior of orphan runners under JIT and non-JIT configurations. These tests ensure that online and busy runners are untagged, while offline runners are terminated.
  • Introduced mock data and functions in scale-down.test.ts and scale-up.test.ts to simulate JIT runner IDs and validate their integration with scaling logic. [1] [2]

Code Refactoring and Imports:

  • Updated imports across multiple files to include DeleteTagsCommand and untag functionality. This ensures consistent usage of the new untagging feature. [1] [2] [3]
  • Refactored getGitHubRunnerBusyState in scale-down.ts to reuse the new getGitHubSelfHostedRunnerState function, improving code clarity and reducing redundancy.

These changes collectively improve the scalability, reliability, and maintainability of the control plane for GitHub Actions runners.

@stuartp44 stuartp44 requested a review from a team as a code owner May 22, 2025 08:25
@stuartp44 stuartp44 requested a review from Copilot May 22, 2025 08:29
Copilot

This comment was marked as outdated.

@stuartp44 stuartp44 changed the title Add last chance check before termination for JIT instances feat: Add last chance check before termination for JIT instances May 22, 2025
@stuartp44 stuartp44 changed the title feat: Add last chance check before termination for JIT instances feat: Add last chance check before orphan termination for JIT instances May 22, 2025
@npalm
Copy link
Member

npalm commented May 22, 2025

@stuartp44 in case the PR is in WIP, can you mark the PR as draft?

@stuartp44
Copy link
Contributor Author

@stuartp44 in case the PR is in WIP, can you mark the PR as draft?

The PR is actually ready to be review, the build failures where unrelated to my changes but I have fixed them nevertheless. @npalm

@npalm npalm self-requested a review May 23, 2025 13:56
Copy link
Member

@npalm npalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the PR. I only was able to a partial review so far. I have checked the Lambda code (excluding th tests). Also not tested a deployment.

This solution is solving the problem only for JIT enabled runners, but for non JIT the problem remains. Assuming the chances is less since typically less runners will be created. Do you have thoughts about alternatives? I think we should find a way / place to document this limitation.

I have made some comments, but need more time to go over the PR. Will ping you once ready, but laready sharing the feedback so far.

@stuartp44 stuartp44 marked this pull request as draft June 3, 2025 15:49
@stuartp44 stuartp44 marked this pull request as ready for review June 4, 2025 10:46
@stuartp44 stuartp44 requested a review from Copilot June 4, 2025 10:46
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds JIT runner support by tagging runner IDs on EC2 instances, then performing a final orphan check to untag busy runners or terminate offline ones. It also refactors runner state retrieval and updates tests to cover the new behavior.

  • Introduce tag/untag for JIT runner lifecycle management
  • Add addGhRunnerIdToEC2InstanceTag and lastChanceCheckOrphanRunner in scale-up/scale-down
  • Update and extend tests for JIT and non-JIT orphan scenarios

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
scale-runners/scale-up.ts Imported tag, added addGhRunnerIdToEC2InstanceTag call
scale-runners/scale-up.test.ts Mocked tag, fixed typo, extended JIT mocks
scale-runners/scale-down.ts Imported untag, added lastChanceCheckOrphanRunner, refactored runner state calls
scale-runners/scale-down.test.ts Mocked untag, added JIT orphan handling tests
aws/runners.ts Added DeleteTagsCommand and untag, updated runnerId extraction
aws/runners.test.ts Added JIT listing test, fixed tag/untag runner tests
Comments suppressed due to low confidence (1)

lambdas/functions/control-plane/src/scale-runners/scale-up.ts:419

  • [nitpick] The function name 'addGhRunnerIdToEC2InstanceTag' is verbose and mixes naming styles. Consider renaming it to something like 'tagRunnerId' for brevity and consistency.
async function addGhRunnerIdToEC2InstanceTag(instanceId: string, runnerId: string): Promise<void> {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pagination Data Slippage Issue causing EC2 instance to be scaled down
2 participants