-
Notifications
You must be signed in to change notification settings - Fork 89
Description
Checks
- I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- I am using charts that are officially provided
Controller Version
0.6.1
Deployment Method
Helm
Checks
- This isn't a question or user support case (For Q&A and community support, go to Discussions).
- I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
- Create a runner in kubernetes mode by the book.
Runner template
template:
metadata:
labels:
app: myarc
spec:
initContainers:
- name: init-k8s-volume-permissions
image: ghcr.io/actions/actions-runner:latest
command: ["/bin/sh", "-c"]
args:
- |
sudo chown -R 1001:123 /home/runner/_work
volumeMounts:
- name: work
mountPath: /home/runner/_work
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
command: ["/home/runner/run.sh"]
env:
- name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
value: "true"
- name: ACTIONS_RUNNER_CONTAINER_HOOKS
value: /home/runner/k8s/index.js
- name: ACTIONS_RUNNER_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
value: /home/runner/pod-templates/default.yaml
securityContext:
runAsUser: 1001
runAsGroup: 123
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
memory: 512Mi
volumeMounts:
- name: work
mountPath: /home/runner/_work
- name: pod-templates
mountPath: /home/runner/pod-templates
readOnly: true
volumes:
- name: pod-templates
configMap:
name: pod-templatesSetting minRunners: 0 and maxRunners: 5 also help highlight this issue in my example.
-
Make sure you are using an autoscaling nodepool (that can go to very few or even zero nodes)
-
Make sure the runner has large memory/cpu requests, for the workflow pod (using PodTemplate)
PodTemplate ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: pod-templates
namespace: gh-arc
data:
default.yaml: |
---
apiVersion: v1
kind: PodTemplate
metadata:
name: runner-pod-template
namespace: gh-arc
labels:
app: runner-pod-template
spec:
securityContext:
runAsUser: 1001
runAsGroup: 123
containers:
- name: $job
resources:
requests:
cpu: 1000m
memory: 8Gi
limits:
memory: 8Gi
- Run a simple pipeline that spawn multiple jobs hence multiple runners.
Here I'm usingstrategy:to make sure they all spawn nearly at the same time), andcontainers:to make sure it creates a-workflowpod.
Actions CI
name: GitHub Actions Test
run-name: Test
on: [push]
jobs:
foo:
runs-on: myarc
container: debian
strategy:
matrix:
package:
- 'common'
- 'utils'
- 'ui'
- 'billing'
steps:
- run: echo "Running for ${{ matrix.package }}"
- name: Check out repository code
uses: actions/checkout@v3
- run: sleep 300- Observe your CI jobs failing
Describe the bug
The rs-controller receives the pipeline request and scale up the runnerset to 4 runners.
Each runner pods get scheduled to one node. Since they only require 500Mi memory each, they all fit in that 16Go memory node.
After initializing, they each spawn a -workflow pod next to them. Kubernetes now tries to schedule 4 pod that each requires 8Go memory on that same single node. It fails due to OOM.
On Github Actions you can see your job failed at the "Initialize containers" step, with :
Run '/home/runner/k8s/index.js'
Warning: Skipping name override: name can't be overwritten
Error: Error: failed to create job pod: HttpError: HTTP request failed
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.
Describe the expected behavior
The -workflow pod should be scheduled by kube-scheduler on a different node.
I think the container-hook does not rely on kube-scheduler and all workflow pods are necessarily spawn next to its runner pod. Maybe due to a constraint I haven't seen (volumes maybe?)
I've seen a lot of people setting resources requests to the container but I'm failing to see how that can solve my issue since the job is actually run on the workflow pod.
Additional Context
All configs are in the reproduction steps
We are mostly trying to reduce our cost on kubernetes nodes and don't find acceptable to have a few large nodes idle, in case a CI pipeline get triggers. Hence the requirement to scale the nodepool to zero and set runner minimum to zero too.
We also provide multiple runner by size, so our developpers can juste pick the desirable runner size with tags like memory-xl, cpu-s, etc.
Controller Logs
https://gist.github.com/guillaumevillemont/9d6bb8cd62ef5c1dd5b78f30b225a182Runner Pod Logs
https://gist.github.com/guillaumevillemont/9d6bb8cd62ef5c1dd5b78f30b225a182