Description
openedon Aug 16, 2023
Hi CML team,
I'm facing an issue with CML when creating a self-hosted runner for GitHub on a Google Cloud Kubernetes cluster.
The runner is created and seems to register to GitHub. However, the workflow never continues and hangs on
"***"level":"info","message":"iterative_cml_runner.runner: Still creating... [hhmsss elapsed]"***"
I'm using the following steps to create the runner:
- Create a personal access token (PAT) with
repo
scope. - Store the PAT in a GitHub repository secret named
CML_PAT
. - Create a Google Service Account Key to allow access to the Kubernetes cluster
- Store the Service Account Key in a GitHub repository secret named
GCP_SERVICE_ACCOUNT_KEY
. - Create a GitHub Workflow file with the following content:
name: Workflow from actions on: push: jobs: setup-runner: runs-on: ubuntu-latest steps: - name: Checkout repository uses: actions/checkout@v3 ## Google Cloud - name: Login to Google Cloud uses: 'google-github-actions/auth@v1' with: credentials_json: '${{ secrets.GCP_SERVICE_ACCOUNT_KEY }}' - name: Get Google Cloud's Kubernetes credentials uses: 'google-github-actions/get-gke-credentials@v1' with: cluster_name: 'mlops-workshop' location: 'europe-west6-a' ## CML - name: Setup Node uses: actions/setup-node@v3 with: node-version: '16' - name: Setup CML uses: iterative/setup-cml@v1 - name: Initialize runner on Kubernetes env: REPO_TOKEN: ${{ secrets.CML_PAT }} run: | export KUBERNETES_CONFIGURATION=$(cat $KUBECONFIG) # https://cml.dev/doc/ref/runner # https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#machine-type # https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#{cpu}-{memory} cml runner launch \ --labels="cml-runner-from-actions" \ --cloud="kubernetes" \ --cloud-type="s" use-runner: needs: setup-runner runs-on: [self-hosted, cml-runner-from-actions] steps: - name: Checkout repository uses: actions/checkout@v3 # Node is required to run CML - name: Setup Node uses: actions/setup-node@v3 with: node-version: '16' - name: Setup CML uses: iterative/setup-cml@v1 - name: Create CML report env: REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }} run: | echo "It's a success!" >> report.md cml comment update --publish report.md
- Create a commit to trigger the workflow.
Here are some logs that might help you:
Logs of the runner just after the start
$ kubectl logs cml-hothxdswe6-5u5j99rk-34bk17gn-gdmc8
Failed to get unit file state for cml.service: No such file or directory
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 84.5M 100 84.5M 0 0 27.8M 0 0:00:03 0:00:03 --:--:-- 41.1M
bash: line 23: lsof: command not found
{"level":"info","message":"POST /repos/csia-pme/cml-with-tpi-from-sources/actions/runners/registration-token - 201 in 290ms"}
{"level":"info","message":"GET /repos/csia-pme/cml-with-tpi-from-sources/actions/runners?per_page=100 - 200 in 249ms"}
{"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."}
{"level":"info","message":"Preparing workdir /home/runner..."}
{"level":"info","message":"Launching github runner"}
{"level":"info","message":"Terraform 1.5.5"}
{"level":"info","message":"Plan: 0 to add, 0 to change, 0 to destroy."}
{"level":"info","message":"Apply complete! Resources: 0 added, 0 changed, 0 destroyed."}
{"level":"info","message":"Outputs: 0"}
{"level":"warn","message":"Error connecting to ACPI socket: connect ENOENT /var/run/acpid.socket. The acpid.service helps with instance termination detection."}
{"level":"info","message":"POST /repos/csia-pme/cml-with-tpi-from-sources/actions/runners/registration-token - 201 in 309ms"}
{"date":"2023-08-16T14:14:13.820Z","level":"info","message":"runner status","repo":"https://github.com/csia-pme/cml-with-tpi-from-sources","status":"ready"}
Logs of the runner after some time
$ kubectl logs cml-hothxdswe6-5u5j99rk-34bk17gn-gdmc8
Failed to get unit file state for cml.service: No such file or directory
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 84.5M 100 84.5M 0 0 27.8M 0 0:00:03 0:00:03 --:--:-- 41.1M
bash: line 23: lsof: command not found
{"level":"info","message":"POST /repos/csia-pme/cml-with-tpi-from-sources/actions/runners/registration-token - 201 in 290ms"}
{"level":"info","message":"GET /repos/csia-pme/cml-with-tpi-from-sources/actions/runners?per_page=100 - 200 in 249ms"}
{"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."}
{"level":"info","message":"Preparing workdir /home/runner..."}
{"level":"info","message":"Launching github runner"}
{"level":"info","message":"Terraform 1.5.5"}
{"level":"info","message":"Plan: 0 to add, 0 to change, 0 to destroy."}
{"level":"info","message":"Apply complete! Resources: 0 added, 0 changed, 0 destroyed."}
{"level":"info","message":"Outputs: 0"}
{"level":"warn","message":"Error connecting to ACPI socket: connect ENOENT /var/run/acpid.socket. The acpid.service helps with instance termination detection."}
{"level":"info","message":"POST /repos/csia-pme/cml-with-tpi-from-sources/actions/runners/registration-token - 201 in 309ms"}
{"date":"2023-08-16T14:14:13.820Z","level":"info","message":"runner status","repo":"https://github.com/csia-pme/cml-with-tpi-from-sources","status":"ready"}
{"level":"info","message":"Unregistering runner cml-hothxdswe6-5u5j99rk-34bk17gn..."}
{"level":"info","message":"GET /repos/csia-pme/cml-with-tpi-from-sources/actions/runners?per_page=100 - 200 in 301ms"}
{"level":"info","message":"DELETE /repos/csia-pme/cml-with-tpi-from-sources/actions/runners/5 - 204 in 410ms"}
{"level":"info","message":"\tSuccess"}
{"level":"info","message":"Waiting 10 seconds to destroy"}
Logs of the GitHub workflow
***"level":"info","message":"POST /repos/csia-pme/cml-with-tpi-from-sources/actions/runners/registration-token - 201 in 133ms"***
***"level":"info","message":"GET /repos/csia-pme/cml-with-tpi-from-sources/actions/runners?per_page=100 - 200 in 130ms"***
***"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."***
***"level":"warn","message":"ignoring RUNNER_NAME environment variable, use CML_RUNNER_NAME or --name instead"***
***"level":"info","message":"Preparing workdir /home/runner/.cml/hothxdswe6..."***
***"level":"info","message":"Deploying cloud runner plan..."***
***"level":"info","message":"Terraform apply..."***
***"level":"info","message":"Terraform 1.5.4"***
***"level":"info","message":"iterative_cml_runner.runner: Plan to create"***
***"level":"info","message":"Plan: 1 to add, 0 to change, 0 to destroy."***
***"level":"info","message":"iterative_cml_runner.runner: Creating..."***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [1m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [1m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [1m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [1m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [1m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [1m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [2m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [2m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [2m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [2m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [2m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [2m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [3m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [3m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [3m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [3m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [3m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [3m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [4m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [4m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [4m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [4m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [4m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [4m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [5m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [5m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [5m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [5m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [5m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [5m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [6m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [6m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [6m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [6m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [6m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [6m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [7m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [7m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [7m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [7m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [7m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [7m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [8m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [8m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [8m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [8m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [8m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [8m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [9m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [9m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [9m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [9m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [9m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [9m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [10m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [10m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [10m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [10m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [10m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [10m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [11m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [11m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [11m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [11m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [11m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [11m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [12m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [12m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [12m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [12m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [12m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [12m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [13m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [13m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [13m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [13m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [13m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [13m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [14m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [14m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [14m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [14m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [14m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [14m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [15m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [15m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [15m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [15m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [15m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [15m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [16m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [16m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [16m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [16m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [16m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [16m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [17m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [17m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [17m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [17m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [17m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [17m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [18m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [18m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [18m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [18m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [18m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [18m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [19m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [19m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [19m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [19m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [19m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [19m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [20m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Creation errored after 20m3s"***
I was able to check if the runner was successfully able to register to GitHub by running the following command (from the GitHub API documentation):
curl -L \
-H "Accept: application/vnd.github+json" \
-H "Authorization: Bearer MY_CML_PAT" \
-H "X-GitHub-Api-Version: 2022-11-28" \
https://api.github.com/repos/csia-pme/cml-with-tpi-from-sources/actions/runners
Output of the cURL command
$ curl -L \
-H "Accept: application/vnd.github+json" \
-H "Authorization: Bearer MY_CML_PAT" \
-H "X-GitHub-Api-Version: 2022-11-28" \
https://api.github.com/repos/csia-pme/cml-with-tpi-from-sources/actions/runners
{
"total_count": 1,
"runners": [
{
"id": 5,
"name": "cml-hothxdswe6-5u5j99rk-34bk17gn",
"os": "Linux",
"status": "online",
"busy": false,
"labels": [
{
"id": 1,
"name": "self-hosted",
"type": "read-only"
},
{
"id": 2,
"name": "Linux",
"type": "read-only"
},
{
"id": 3,
"name": "X64",
"type": "read-only"
},
{
"id": 5,
"name": "cml-runner-from-actions",
"type": "custom"
}
]
}
]
}
You can find a repository with the code used to reproduce this issue here.
I created two workflows to test the runner:
workflow-from-actions.yml
using CML official GitHub Actionsworkflow-from-sources.yml
using CML and TPI from sources
You can find the execution of the two workflows here and here.
I did try all sorts of things to try to make it work, but I was not able to find a solution. I tried to:
- Use a different runner (I tried with a runner with different specs and on a different GitHub repository)
- Set a PAT with all the scopes
- Set a PAT with only the
repo
scope - Add
permissions
to the GitHub workflow file - Tried older versions of CML (
0.18.x
) - Make usage of the hidden
--cloud-image="iterativeai/cml:0-dvc3-base1-gpu"
,--tpi-version="= 0.11.18"
and--cml-version="0.19.0"
arguments to set older versions of CML and TPI - Build from sources
Please let me know if I can be of any help and thank you!