Skip to content

CML Kubernetes self-hosted runner is registered to GitHub but the workflow never continues #1415

Open

Description

Hi CML team,

I'm facing an issue with CML when creating a self-hosted runner for GitHub on a Google Cloud Kubernetes cluster.

The runner is created and seems to register to GitHub. However, the workflow never continues and hangs on

"***"level":"info","message":"iterative_cml_runner.runner: Still creating... [hhmsss elapsed]"***"

I'm using the following steps to create the runner:

  1. Create a personal access token (PAT) with repo scope.
  2. Store the PAT in a GitHub repository secret named CML_PAT.
  3. Create a Google Service Account Key to allow access to the Kubernetes cluster
  4. Store the Service Account Key in a GitHub repository secret named GCP_SERVICE_ACCOUNT_KEY.
  5. Create a GitHub Workflow file with the following content:
    name: Workflow from actions
    
    on:
      push:
    
    jobs:
      setup-runner:
        runs-on: ubuntu-latest
        steps:
          - name: Checkout repository
            uses: actions/checkout@v3
          ## Google Cloud
          - name: Login to Google Cloud
            uses: 'google-github-actions/auth@v1'
            with:
              credentials_json: '${{ secrets.GCP_SERVICE_ACCOUNT_KEY }}'
          - name: Get Google Cloud's Kubernetes credentials
            uses: 'google-github-actions/get-gke-credentials@v1'
            with:
              cluster_name: 'mlops-workshop'
              location: 'europe-west6-a'
          ## CML
          - name: Setup Node
            uses: actions/setup-node@v3
            with:
              node-version: '16'
          - name: Setup CML
            uses: iterative/setup-cml@v1
          - name: Initialize runner on Kubernetes
            env:
              REPO_TOKEN: ${{ secrets.CML_PAT }}
            run: |
              export KUBERNETES_CONFIGURATION=$(cat $KUBECONFIG)
              # https://cml.dev/doc/ref/runner
              # https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#machine-type
              # https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#{cpu}-{memory}
              cml runner launch \
                --labels="cml-runner-from-actions" \
                --cloud="kubernetes" \
                --cloud-type="s"
    
      use-runner:
        needs: setup-runner
        runs-on: [self-hosted, cml-runner-from-actions]
        steps:
          - name: Checkout repository
            uses: actions/checkout@v3
          # Node is required to run CML
          - name: Setup Node
            uses: actions/setup-node@v3
            with:
              node-version: '16'
          - name: Setup CML
            uses: iterative/setup-cml@v1
          - name: Create CML report
            env:
              REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
            run: |
              echo "It's a success!" >> report.md
              cml comment update --publish report.md
  6. Create a commit to trigger the workflow.

Here are some logs that might help you:

Logs of the runner just after the start
$ kubectl logs cml-hothxdswe6-5u5j99rk-34bk17gn-gdmc8
Failed to get unit file state for cml.service: No such file or directory
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 84.5M  100 84.5M    0     0  27.8M      0  0:00:03  0:00:03 --:--:-- 41.1M
bash: line 23: lsof: command not found
{"level":"info","message":"POST /repos/csia-pme/cml-with-tpi-from-sources/actions/runners/registration-token - 201 in 290ms"}
{"level":"info","message":"GET /repos/csia-pme/cml-with-tpi-from-sources/actions/runners?per_page=100 - 200 in 249ms"}
{"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."}
{"level":"info","message":"Preparing workdir /home/runner..."}
{"level":"info","message":"Launching github runner"}
{"level":"info","message":"Terraform 1.5.5"}
{"level":"info","message":"Plan: 0 to add, 0 to change, 0 to destroy."}
{"level":"info","message":"Apply complete! Resources: 0 added, 0 changed, 0 destroyed."}
{"level":"info","message":"Outputs: 0"}
{"level":"warn","message":"Error connecting to ACPI socket: connect ENOENT /var/run/acpid.socket. The acpid.service helps with instance termination detection."}
{"level":"info","message":"POST /repos/csia-pme/cml-with-tpi-from-sources/actions/runners/registration-token - 201 in 309ms"}
{"date":"2023-08-16T14:14:13.820Z","level":"info","message":"runner status","repo":"https://github.com/csia-pme/cml-with-tpi-from-sources","status":"ready"}
Logs of the runner after some time
$ kubectl logs cml-hothxdswe6-5u5j99rk-34bk17gn-gdmc8
Failed to get unit file state for cml.service: No such file or directory
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 84.5M  100 84.5M    0     0  27.8M      0  0:00:03  0:00:03 --:--:-- 41.1M
bash: line 23: lsof: command not found
{"level":"info","message":"POST /repos/csia-pme/cml-with-tpi-from-sources/actions/runners/registration-token - 201 in 290ms"}
{"level":"info","message":"GET /repos/csia-pme/cml-with-tpi-from-sources/actions/runners?per_page=100 - 200 in 249ms"}
{"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."}
{"level":"info","message":"Preparing workdir /home/runner..."}
{"level":"info","message":"Launching github runner"}
{"level":"info","message":"Terraform 1.5.5"}
{"level":"info","message":"Plan: 0 to add, 0 to change, 0 to destroy."}
{"level":"info","message":"Apply complete! Resources: 0 added, 0 changed, 0 destroyed."}
{"level":"info","message":"Outputs: 0"}
{"level":"warn","message":"Error connecting to ACPI socket: connect ENOENT /var/run/acpid.socket. The acpid.service helps with instance termination detection."}
{"level":"info","message":"POST /repos/csia-pme/cml-with-tpi-from-sources/actions/runners/registration-token - 201 in 309ms"}
{"date":"2023-08-16T14:14:13.820Z","level":"info","message":"runner status","repo":"https://github.com/csia-pme/cml-with-tpi-from-sources","status":"ready"}
{"level":"info","message":"Unregistering runner cml-hothxdswe6-5u5j99rk-34bk17gn..."}
{"level":"info","message":"GET /repos/csia-pme/cml-with-tpi-from-sources/actions/runners?per_page=100 - 200 in 301ms"}
{"level":"info","message":"DELETE /repos/csia-pme/cml-with-tpi-from-sources/actions/runners/5 - 204 in 410ms"}
{"level":"info","message":"\tSuccess"}
{"level":"info","message":"Waiting 10 seconds to destroy"}
Logs of the GitHub workflow
***"level":"info","message":"POST /repos/csia-pme/cml-with-tpi-from-sources/actions/runners/registration-token - 201 in 133ms"***
***"level":"info","message":"GET /repos/csia-pme/cml-with-tpi-from-sources/actions/runners?per_page=100 - 200 in 130ms"***
***"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."***
***"level":"warn","message":"ignoring RUNNER_NAME environment variable, use CML_RUNNER_NAME or --name instead"***
***"level":"info","message":"Preparing workdir /home/runner/.cml/hothxdswe6..."***
***"level":"info","message":"Deploying cloud runner plan..."***
***"level":"info","message":"Terraform apply..."***
***"level":"info","message":"Terraform 1.5.4"***
***"level":"info","message":"iterative_cml_runner.runner: Plan to create"***
***"level":"info","message":"Plan: 1 to add, 0 to change, 0 to destroy."***
***"level":"info","message":"iterative_cml_runner.runner: Creating..."***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [1m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [1m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [1m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [1m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [1m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [1m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [2m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [2m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [2m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [2m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [2m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [2m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [3m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [3m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [3m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [3m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [3m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [3m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [4m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [4m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [4m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [4m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [4m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [4m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [5m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [5m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [5m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [5m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [5m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [5m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [6m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [6m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [6m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [6m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [6m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [6m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [7m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [7m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [7m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [7m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [7m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [7m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [8m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [8m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [8m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [8m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [8m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [8m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [9m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [9m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [9m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [9m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [9m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [9m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [10m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [10m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [10m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [10m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [10m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [10m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [11m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [11m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [11m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [11m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [11m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [11m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [12m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [12m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [12m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [12m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [12m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [12m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [13m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [13m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [13m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [13m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [13m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [13m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [14m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [14m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [14m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [14m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [14m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [14m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [15m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [15m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [15m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [15m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [15m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [15m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [16m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [16m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [16m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [16m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [16m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [16m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [17m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [17m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [17m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [17m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [17m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [17m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [18m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [18m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [18m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [18m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [18m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [18m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [19m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [19m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [19m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [19m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [19m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [19m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [20m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Creation errored after 20m3s"***

I was able to check if the runner was successfully able to register to GitHub by running the following command (from the GitHub API documentation):

curl -L \
    -H "Accept: application/vnd.github+json" \
    -H "Authorization: Bearer MY_CML_PAT" \
    -H "X-GitHub-Api-Version: 2022-11-28" \
    https://api.github.com/repos/csia-pme/cml-with-tpi-from-sources/actions/runners
Output of the cURL command
$ curl -L \
    -H "Accept: application/vnd.github+json" \
    -H "Authorization: Bearer MY_CML_PAT" \
    -H "X-GitHub-Api-Version: 2022-11-28" \
    https://api.github.com/repos/csia-pme/cml-with-tpi-from-sources/actions/runners
{
  "total_count": 1,
  "runners": [
    {
      "id": 5,
      "name": "cml-hothxdswe6-5u5j99rk-34bk17gn",
      "os": "Linux",
      "status": "online",
      "busy": false,
      "labels": [
        {
          "id": 1,
          "name": "self-hosted",
          "type": "read-only"
        },
        {
          "id": 2,
          "name": "Linux",
          "type": "read-only"
        },
        {
          "id": 3,
          "name": "X64",
          "type": "read-only"
        },
        {
          "id": 5,
          "name": "cml-runner-from-actions",
          "type": "custom"
        }
      ]
    }
  ]
}

You can find a repository with the code used to reproduce this issue here.

I created two workflows to test the runner:

You can find the execution of the two workflows here and here.

I did try all sorts of things to try to make it work, but I was not able to find a solution. I tried to:

  • Use a different runner (I tried with a runner with different specs and on a different GitHub repository)
  • Set a PAT with all the scopes
  • Set a PAT with only the repo scope
  • Add permissions to the GitHub workflow file
  • Tried older versions of CML (0.18.x)
  • Make usage of the hidden --cloud-image="iterativeai/cml:0-dvc3-base1-gpu", --tpi-version="= 0.11.18" and --cml-version="0.19.0" arguments to set older versions of CML and TPI
  • Build from sources

Please let me know if I can be of any help and thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions