Skip to content

feat: Add last chance check before orphan termination for JIT instances #4595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
3fa09d2
feat(runners): add runnerId to RunnerList and implement untag functio…
stuartp44 May 21, 2025
14df9fe
fix(tests): improve clarity of orphaned runner untagging test descrip…
stuartp44 May 22, 2025
9d7c89a
fmt
stuartp44 May 22, 2025
e09337f
fix(scale-down): remove unnecessary logging of runner variable in ter…
stuartp44 May 22, 2025
9064512
fix(scale-up): remove unused import of run function
stuartp44 May 22, 2025
716e079
fix(scale-down): remove unused import of metricGitHubAppRateLimit
stuartp44 May 23, 2025
e826fbe
Merge branch 'main' into stu/add_tag_plus_check
npalm May 23, 2025
a5fcc88
Update lambdas/functions/control-plane/src/scale-runners/scale-down.ts
stuartp44 May 26, 2025
bb8ba2b
Update lambdas/functions/control-plane/src/scale-runners/scale-down.ts
stuartp44 May 26, 2025
97de234
Update lambdas/functions/control-plane/src/scale-runners/scale-up.ts
stuartp44 May 26, 2025
8b61d39
Update lambdas/functions/control-plane/src/scale-runners/scale-up.ts
stuartp44 May 26, 2025
9883b0b
Remove warning log for orphan runners without runnerId in scale-down …
stuartp44 May 29, 2025
43468a7
Remove logging of runner ID marking in addGhRunnerIdToEC2InstanceTag …
stuartp44 May 29, 2025
bc995ef
readded metricGitHubAppRateLimit
stuartp44 Jun 3, 2025
1182c8b
Merge branch 'main' into stu/add_tag_plus_check
stuartp44 Jun 3, 2025
6e1c72c
Refactor runner interfaces: remove RunnerState interface and update i…
stuartp44 Jun 3, 2025
097c14d
Add headers to runner state return and update logging for busy state
stuartp44 Jun 4, 2025
971ec2d
Remove redundant comment describing RunnerState type
stuartp44 Jun 4, 2025
5a275e5
Implement last chance check for orphan runners and refactor terminati…
stuartp44 Jun 4, 2025
9f59abe
Format return object in getGitHubSelfHostedRunnerState for improved r…
stuartp44 Jun 4, 2025
65c0b0e
Refactor runner state types to use Endpoints for improved type safety…
stuartp44 Jun 4, 2025
8ead598
Fix formatting of type definitions and adjust indentation in getGitHu…
stuartp44 Jun 4, 2025
ab5b6b0
Update lambdas/functions/control-plane/src/aws/runners.ts
stuartp44 Jun 4, 2025
0b462bb
Update lambdas/functions/control-plane/src/scale-runners/scale-down.t…
stuartp44 Jun 4, 2025
102edf0
Fix typo in key for GitHub runner ID in mock running instances
stuartp44 Jun 4, 2025
e6d2d88
Merge branch 'main' into stu/add_tag_plus_check
npalm Jun 10, 2025
83610eb
Merge branch 'main' into stu/add_tag_plus_check
npalm Jun 11, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions lambdas/functions/control-plane/src/aws/runners.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ export interface RunnerList {
repo?: string;
org?: string;
orphan?: boolean;
runnerId?: string;
}

export interface RunnerInfo {
Expand Down
67 changes: 63 additions & 4 deletions lambdas/functions/control-plane/src/aws/runners.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ import {
CreateFleetInstance,
CreateFleetResult,
CreateTagsCommand,
DeleteTagsCommand,
DefaultTargetCapacityType,
DescribeInstancesCommand,
DescribeInstancesResult,
Expand All @@ -17,7 +18,7 @@ import { mockClient } from 'aws-sdk-client-mock';
import 'aws-sdk-client-mock-jest/vitest';

import ScaleError from './../scale-runners/ScaleError';
import { createRunner, listEC2Runners, tag, terminateRunner } from './runners';
import { createRunner, listEC2Runners, tag, untag, terminateRunner } from './runners';
import { RunnerInfo, RunnerInputParameters, RunnerType } from './runners.d';
import { describe, it, expect, beforeEach, vi } from 'vitest';

Expand Down Expand Up @@ -53,14 +54,34 @@ const mockRunningInstances: DescribeInstancesResult = {
},
],
};
const mockRunningInstancesJit: DescribeInstancesResult = {
Reservations: [
{
Instances: [
{
LaunchTime: new Date('2020-10-10T14:48:00.000+09:00'),
InstanceId: 'i-1234',
Tags: [
{ Key: 'ghr:Application', Value: 'github-action-runner' },
{ Key: 'ghr:runner_name_prefix', Value: RUNNER_NAME_PREFIX },
{ Key: 'ghr:created_by', Value: 'scale-up-lambda' },
{ Key: 'ghr:Type', Value: 'Org' },
{ Key: 'ghr:Owner', Value: 'CoderToCat' },
{ Key: 'ghr:github_runner_id', Value: '9876543210' },
],
},
],
},
],
};

describe('list instances', () => {
beforeEach(() => {
vi.resetModules();
vi.clearAllMocks();
});

it('returns a list of instances', async () => {
it('returns a list of instances (Non JIT)', async () => {
mockEC2Client.on(DescribeInstancesCommand).resolves(mockRunningInstances);
const resp = await listEC2Runners();
expect(resp.length).toBe(1);
Expand All @@ -73,6 +94,20 @@ describe('list instances', () => {
});
});

it('returns a list of instances (JIT)', async () => {
mockEC2Client.on(DescribeInstancesCommand).resolves(mockRunningInstancesJit);
const resp = await listEC2Runners();
expect(resp.length).toBe(1);
expect(resp).toContainEqual({
instanceId: 'i-1234',
launchTime: new Date('2020-10-10T14:48:00.000+09:00'),
type: 'Org',
owner: 'CoderToCat',
orphan: false,
runnerId: '9876543210',
});
});

it('check orphan tag.', async () => {
const instances: DescribeInstancesResult = mockRunningInstances;
instances.Reservations![0].Instances![0].Tags!.push({ Key: 'ghr:orphan', Value: 'true' });
Expand Down Expand Up @@ -229,11 +264,35 @@ describe('tag runner', () => {
owner: 'owner-2',
type: 'Repo',
};
await tag(runner.instanceId, [{ Key: 'ghr:orphan', Value: 'truer' }]);
await tag(runner.instanceId, [{ Key: 'ghr:orphan', Value: 'true' }]);

expect(mockEC2Client).toHaveReceivedCommandWith(CreateTagsCommand, {
Resources: [runner.instanceId],
Tags: [{ Key: 'ghr:orphan', Value: 'true' }],
});
});
});

describe('untag runner', () => {
beforeEach(() => {
vi.clearAllMocks();
});
it('removing extra tag', async () => {
mockEC2Client.on(DeleteTagsCommand).resolves({});
const runner: RunnerInfo = {
instanceId: 'instance-2',
owner: 'owner-2',
type: 'Repo',
};
//await tag(runner.instanceId, [{ Key: 'ghr:orphan', Value: '' }]);
expect(mockEC2Client).toHaveReceivedCommandWith(CreateTagsCommand, {
Resources: [runner.instanceId],
Tags: [{ Key: 'ghr:orphan', Value: 'truer' }],
Tags: [{ Key: 'ghr:orphan', Value: 'true' }],
});
await untag(runner.instanceId, [{ Key: 'ghr:orphan', Value: 'true' }]);
expect(mockEC2Client).toHaveReceivedCommandWith(DeleteTagsCommand, {
Resources: [runner.instanceId],
Tags: [{ Key: 'ghr:orphan', Value: 'true' }],
});
});
});
Expand Down
8 changes: 8 additions & 0 deletions lambdas/functions/control-plane/src/aws/runners.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ import {
CreateFleetCommand,
CreateFleetResult,
CreateTagsCommand,
DeleteTagsCommand,
DescribeInstancesCommand,
DescribeInstancesResult,
EC2Client,
Expand Down Expand Up @@ -91,6 +92,7 @@ function getRunnerInfo(runningInstances: DescribeInstancesResult) {
repo: i.Tags?.find((e) => e.Key === 'ghr:Repo')?.Value as string,
org: i.Tags?.find((e) => e.Key === 'ghr:Org')?.Value as string,
orphan: i.Tags?.find((e) => e.Key === 'ghr:orphan')?.Value === 'true',
runnerId: i.Tags?.find((e) => e.Key === 'ghr:github_runner_id')?.Value as string,
});
}
}
Expand All @@ -112,6 +114,12 @@ export async function tag(instanceId: string, tags: Tag[]): Promise<void> {
await ec2.send(new CreateTagsCommand({ Resources: [instanceId], Tags: tags }));
}

export async function untag(instanceId: string, tags: Tag[]): Promise<void> {
logger.debug(`Untagging '${instanceId}'`, { tags });
const ec2 = getTracedAWSV3Client(new EC2Client({ region: process.env.AWS_REGION }));
await ec2.send(new DeleteTagsCommand({ Resources: [instanceId], Tags: tags }));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test untagging via an actual deployment. I do not see any update to the permisison of the lmabda role. So assuming the call wil fail.

}

function generateFleetOverrides(
subnetIds: string[],
instancesTypes: string[],
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ import nock from 'nock';

import { RunnerInfo, RunnerList } from '../aws/runners.d';
import * as ghAuth from '../github/auth';
import { listEC2Runners, terminateRunner, tag } from './../aws/runners';
import { listEC2Runners, terminateRunner, tag, untag } from './../aws/runners';
import { githubCache } from './cache';
import { newestFirstStrategy, oldestFirstStrategy, scaleDown } from './scale-down';
import { describe, it, expect, beforeEach, vi } from 'vitest';
Expand Down Expand Up @@ -33,6 +33,7 @@ vi.mock('./../aws/runners', async (importOriginal) => {
return {
...actual,
tag: vi.fn(),
untag: vi.fn(),
terminateRunner: vi.fn(),
listEC2Runners: vi.fn(),
};
Expand Down Expand Up @@ -62,6 +63,7 @@ const mockedInstallationAuth = vi.mocked(ghAuth.createGithubInstallationAuth);
const mockCreateClient = vi.mocked(ghAuth.createOctokitClient);
const mockListRunners = vi.mocked(listEC2Runners);
const mockTagRunners = vi.mocked(tag);
const mockUntagRunners = vi.mocked(untag);
const mockTerminateRunners = vi.mocked(terminateRunner);

export interface TestData {
Expand Down Expand Up @@ -312,7 +314,7 @@ describe('Scale down runners', () => {
checkNonTerminated(runners);
});

it(`Should terminate orphan.`, async () => {
it(`Should terminate orphan (Non JIT)`, async () => {
// setup
const orphanRunner = createRunnerTestData('orphan-1', type, MINIMUM_BOOT_TIME + 1, false, false, false);
const idleRunner = createRunnerTestData('idle-1', type, MINIMUM_BOOT_TIME + 1, true, false, false);
Expand All @@ -334,6 +336,7 @@ describe('Scale down runners', () => {
Value: 'true',
},
]);

expect(mockTagRunners).not.toHaveBeenCalledWith(idleRunner.instanceId, expect.anything());

// next cycle, update test data set orphan to true and terminate should be true
Expand All @@ -348,6 +351,58 @@ describe('Scale down runners', () => {
checkNonTerminated(runners);
});

it('Should test if orphaned runner, untag if online and busy, else terminate (JIT)', async () => {
// arrange
const orphanRunner = createRunnerTestData(
'orphan-jit',
type,
MINIMUM_BOOT_TIME + 1,
false,
true,
false,
undefined,
1234567890,
);
const runners = [orphanRunner];

mockGitHubRunners([]);
mockAwsRunners(runners);

if (type === 'Repo') {
mockOctokit.actions.getSelfHostedRunnerForRepo.mockResolvedValueOnce({
data: { id: 1234567890, name: orphanRunner.instanceId, busy: true, status: 'online' },
});
} else {
mockOctokit.actions.getSelfHostedRunnerForOrg.mockResolvedValueOnce({
data: { id: 1234567890, name: orphanRunner.instanceId, busy: true, status: 'online' },
});
}

// act
await scaleDown();

// assert
expect(mockUntagRunners).toHaveBeenCalledWith(orphanRunner.instanceId, [{ Key: 'ghr:orphan', Value: 'true' }]);
expect(mockTerminateRunners).not.toHaveBeenCalledWith(orphanRunner.instanceId);

// arrange
if (type === 'Repo') {
mockOctokit.actions.getSelfHostedRunnerForRepo.mockResolvedValueOnce({
data: { runnerId: 1234567890, name: orphanRunner.instanceId, busy: true, status: 'offline' },
});
} else {
mockOctokit.actions.getSelfHostedRunnerForOrg.mockResolvedValueOnce({
data: { runnerId: 1234567890, name: orphanRunner.instanceId, busy: true, status: 'offline' },
});
}

// act
await scaleDown();

// assert
expect(mockTerminateRunners).toHaveBeenCalledWith(orphanRunner.instanceId);
});

it(`Should ignore errors when termination orphan fails.`, async () => {
// setup
const orphanRunner = createRunnerTestData('orphan-1', type, MINIMUM_BOOT_TIME + 1, false, true, true);
Expand Down Expand Up @@ -625,6 +680,7 @@ function createRunnerTestData(
orphan: boolean,
shouldBeTerminated: boolean,
owner?: string,
runnerId?: number,
): RunnerTestItem {
return {
instanceId: `i-${name}-${type.toLowerCase()}`,
Expand All @@ -638,5 +694,6 @@ function createRunnerTestData(
registered,
orphan,
shouldBeTerminated,
runnerId: runnerId !== undefined ? String(runnerId) : undefined,
};
}
67 changes: 56 additions & 11 deletions lambdas/functions/control-plane/src/scale-runners/scale-down.ts
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
import { Octokit } from '@octokit/rest';
import { Endpoints } from '@octokit/types';
import { createChildLogger } from '@aws-github-runner/aws-powertools-util';
import moment from 'moment';

import { createGithubAppAuth, createGithubInstallationAuth, createOctokitClient } from '../github/auth';
import { bootTimeExceeded, listEC2Runners, tag, terminateRunner } from './../aws/runners';
import { bootTimeExceeded, listEC2Runners, tag, untag, terminateRunner } from './../aws/runners';
import { RunnerInfo, RunnerList } from './../aws/runners.d';
import { GhRunners, githubCache } from './cache';
import { ScalingDownConfig, getEvictionStrategy, getIdleRunnerCount } from './scale-down-config';
Expand All @@ -12,6 +13,10 @@ import { getGitHubEnterpriseApiUrl } from './scale-up';

const logger = createChildLogger('scale-down');

type OrgRunnerList = Endpoints['GET /orgs/{org}/actions/runners']['response']['data']['runners'];
type RepoRunnerList = Endpoints['GET /repos/{owner}/{repo}/actions/runners']['response']['data']['runners'];
type RunnerState = OrgRunnerList[number] | RepoRunnerList[number];

async function getOrCreateOctokit(runner: RunnerInfo): Promise<Octokit> {
const key = runner.owner;
const cachedOctokit = githubCache.clients.get(key);
Expand Down Expand Up @@ -46,7 +51,11 @@ async function getOrCreateOctokit(runner: RunnerInfo): Promise<Octokit> {
return octokit;
}

async function getGitHubRunnerBusyState(client: Octokit, ec2runner: RunnerInfo, runnerId: number): Promise<boolean> {
async function getGitHubSelfHostedRunnerState(
client: Octokit,
ec2runner: RunnerInfo,
runnerId: number,
): Promise<RunnerState> {
const state =
ec2runner.type === 'Org'
? await client.actions.getSelfHostedRunnerForOrg({
Expand All @@ -58,12 +67,24 @@ async function getGitHubRunnerBusyState(client: Octokit, ec2runner: RunnerInfo,
owner: ec2runner.owner.split('/')[0],
repo: ec2runner.owner.split('/')[1],
});

logger.info(`Runner '${ec2runner.instanceId}' - GitHub Runner ID '${runnerId}' - Busy: ${state.data.busy}`);

metricGitHubAppRateLimit(state.headers);

return state.data.busy;
return {
id: state.data.id,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it not simpler to re-use either the github state object or simpley explode return the data object instead of this 1-1 mapping.

name: state.data.name,
busy: state.data.busy,
status: state.data.status,
os: state.data.os,
labels: state.data.labels,
runner_group_id: state.data.runner_group_id,
ephemeral: state.data.ephemeral,
};
}

async function getGitHubRunnerBusyState(client: Octokit, ec2runner: RunnerInfo, runnerId: number): Promise<boolean> {
const state = await getGitHubSelfHostedRunnerState(client, ec2runner, runnerId);
logger.info(`Runner '${ec2runner.instanceId}' - GitHub Runner ID '${runnerId}' - Busy: ${state.busy}`);
return state.busy;
}

async function listGitHubRunners(runner: RunnerInfo): Promise<GhRunners> {
Expand Down Expand Up @@ -200,18 +221,42 @@ async function markOrphan(instanceId: string): Promise<void> {
}
}

async function lastChanceCheckOrphanRunner(runner: RunnerList): Promise<void> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me the method is confusing, it looks terminationg is now down on two palces. Makes the code flow hard to follow. Also lastChanceCheckOrphanRunner sounds like a last check, not terinating.

const client = await getOrCreateOctokit(runner as RunnerInfo);
const runnerId = parseInt(runner.runnerId || '0');
const ec2Instance = runner as RunnerInfo;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why casting here? Does the input object not contain the right information?

const state = await getGitHubSelfHostedRunnerState(client, ec2Instance, runnerId);
logger.debug(`Runner is currently '${runner.instanceId}' state: ${JSON.stringify(state)}`);
if (state.status === 'online' && state.busy) {
logger.info(`Runner '${runner.instanceId}' is orphan, but is online and busy.`);
await untag(runner.instanceId, [{ Key: 'ghr:orphan', Value: 'true' }]);
} else if (state.status === 'offline') {
logger.warn(`Runner '${runner.instanceId}' is orphan.`);
logger.info(`Terminating orphan runner '${runner.instanceId}'`);
await terminateRunner(runner.instanceId).catch((e) => {
logger.error(`Failed to terminate orphan runner '${runner.instanceId}'`, { error: e });
});
}
}

async function terminateOrphan(environment: string): Promise<void> {
try {
const orphanRunners = await listEC2Runners({ environment, orphan: true });

for (const runner of orphanRunners) {
logger.info(`Terminating orphan runner '${runner.instanceId}'`);
await terminateRunner(runner.instanceId).catch((e) => {
logger.error(`Failed to terminate orphan runner '${runner.instanceId}'`, { error: e });
});
// do we have a valid runnerId? then we are in a Jit Runner scenario else, use old method
if (runner.runnerId) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code remains simpler if you update fhe code as follow.

pseudeo code

if (runner.id) {
   checkRunnerStillOrphan ? : terminate() : untag()
else 
   terminate
}

or similar this wil keep the main control flow in one view. And avoiding termination is invoked in two differnet functions.

logger.debug(`Runner '${runner.instanceId}' is orphan, but has a runnerId.`);
await lastChanceCheckOrphanRunner(runner);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

untagging can lead to an exception, exception looks like not handled at all. Would suggest to handle a potential exception inside the untag function and return a boolean.

} else {
logger.info(`Terminating orphan runner '${runner.instanceId}'`);
await terminateRunner(runner.instanceId).catch((e) => {
logger.error(`Failed to terminate orphan runner '${runner.instanceId}'`, { error: e });
});
}
}
} catch (e) {
logger.warn(`Failure during orphan runner termination.`, { error: e });
logger.warn(`Failure during orphan termination processing.`, { error: e });
}
}

Expand Down
Loading