Skip to content

Executing the custom container implementation failed due to Node out of resources #112

@guillaumevillemont

Description

@guillaumevillemont

Checks

Controller Version

0.6.1

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

  1. Create a runner in kubernetes mode by the book.
Runner template
template:
metadata:
    labels:
      app: myarc
spec:
  initContainers:
  - name: init-k8s-volume-permissions
    image: ghcr.io/actions/actions-runner:latest
    command: ["/bin/sh", "-c"]
    args:
      - |
        sudo chown -R 1001:123 /home/runner/_work
    volumeMounts:
      - name: work
        mountPath: /home/runner/_work
  containers:
  - name: runner
    image: ghcr.io/actions/actions-runner:latest
    command: ["/home/runner/run.sh"]
    env:
      - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
        value: "true"
      - name: ACTIONS_RUNNER_CONTAINER_HOOKS
        value: /home/runner/k8s/index.js
      - name: ACTIONS_RUNNER_POD_NAME
        valueFrom:
          fieldRef:
            fieldPath: metadata.name
      - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
        value: /home/runner/pod-templates/default.yaml
    securityContext:
      runAsUser: 1001
      runAsGroup: 123
    resources:
      requests:
        cpu: 500m
        memory: 512Mi
      limits:
        memory: 512Mi
    volumeMounts:
      - name: work
        mountPath: /home/runner/_work
      - name: pod-templates
        mountPath: /home/runner/pod-templates
        readOnly: true
  volumes:
    - name: pod-templates
      configMap:
        name: pod-templates

Setting minRunners: 0 and maxRunners: 5 also help highlight this issue in my example.

  1. Make sure you are using an autoscaling nodepool (that can go to very few or even zero nodes)

  2. Make sure the runner has large memory/cpu requests, for the workflow pod (using PodTemplate)

PodTemplate ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: pod-templates
namespace: gh-arc
data:
default.yaml: |
  ---
  apiVersion: v1
  kind: PodTemplate
  metadata:
    name: runner-pod-template
    namespace: gh-arc
    labels:
      app: runner-pod-template
  spec:
    securityContext:
      runAsUser: 1001
      runAsGroup: 123
    containers:
    - name: $job
      resources:
        requests:
          cpu: 1000m
          memory: 8Gi
        limits:
          memory: 8Gi
  1. Run a simple pipeline that spawn multiple jobs hence multiple runners.
    Here I'm using strategy: to make sure they all spawn nearly at the same time), and containers: to make sure it creates a -workflow pod.
Actions CI
name: GitHub Actions Test
run-name: Test
on: [push]
jobs:
foo:
  runs-on: myarc
  container: debian
  strategy:
    matrix:
      package:
        - 'common'
        - 'utils'
        - 'ui'
        - 'billing'
  steps:
    - run: echo "Running for ${{ matrix.package }}"
    - name: Check out repository code
      uses: actions/checkout@v3
    - run: sleep 300
  1. Observe your CI jobs failing

Describe the bug

The rs-controller receives the pipeline request and scale up the runnerset to 4 runners.

Each runner pods get scheduled to one node. Since they only require 500Mi memory each, they all fit in that 16Go memory node.

After initializing, they each spawn a -workflow pod next to them. Kubernetes now tries to schedule 4 pod that each requires 8Go memory on that same single node. It fails due to OOM.

On Github Actions you can see your job failed at the "Initialize containers" step, with :

Run '/home/runner/k8s/index.js'
Warning: Skipping name override: name can't be overwritten
Error: Error: failed to create job pod: HttpError: HTTP request failed
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.

Describe the expected behavior

The -workflow pod should be scheduled by kube-scheduler on a different node.

I think the container-hook does not rely on kube-scheduler and all workflow pods are necessarily spawn next to its runner pod. Maybe due to a constraint I haven't seen (volumes maybe?)

I've seen a lot of people setting resources requests to the container but I'm failing to see how that can solve my issue since the job is actually run on the workflow pod.

Additional Context

All configs are in the reproduction steps

We are mostly trying to reduce our cost on kubernetes nodes and don't find acceptable to have a few large nodes idle, in case a CI pipeline get triggers. Hence the requirement to scale the nodepool to zero and set runner minimum to zero too.

We also provide multiple runner by size, so our developpers can juste pick the desirable runner size with tags like memory-xl, cpu-s, etc.

Controller Logs

https://gist.github.com/guillaumevillemont/9d6bb8cd62ef5c1dd5b78f30b225a182

Runner Pod Logs

https://gist.github.com/guillaumevillemont/9d6bb8cd62ef5c1dd5b78f30b225a182

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or requestk8s

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions