Skip to content

GCP VM instance does not terminate after idle timeout #1386

Open

Description

Related to issue #834

Facing this issue with cml v0.19.0

I'm starting GCP instance using the following command inside a docker container which uses iterativeai/cml:latest image in a buildkite pipeline. I also share GOOGLE_APPLICATION_CREDENTIALS_DATA as environment variable, it starts and register self-hosted runner without any problem.

I configured gcloud with:

echo "$GOOGLE_APPLICATION_CREDENTIALS_DATA" >~/gcloud-service-account-key.json
gcloud -q auth activate-service-account --key-file ~/gcloud-service-account-key.json
gcloud -q config set project PROJECT
gcloud -q auth configure-docker

run cml runner launch with

cml runner launch \
--cloud=gcp \
--cloud-region=us-central1-a \
--cloud-type=m+t4	 \
--name="$CML_RUNNER_NAME" \
--labels=gcp-snapshot-creation \
--token="$BUILDKITE_GITHUB_TOKEN" \
--repo=https://github.com/ORG/REPO.git \
--cloud-hdd-size=60 \
--idle-timeout=120

However, after idle-timeout passes GCP VM doesn't terminate. The output of journalctl --unit cml --no-pager is as follows:

-- Logs begin at Fri 2023-06-09 10:33:16 UTC, end at Fri 2023-06-09 11:28:01 UTC. --
Jun 09 10:38:43 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 systemd[1]: Started cml.service.
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"POST /repos/ORG/REPO/actions/runners/registration-token - 201 in 287ms"}
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 200 in 234ms"}
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."}
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Preparing workdir /tmp/tmp.SpVQaKjpnL/.cml/cml-REPO-snapshot-t4-wi89wx1i-4src3nn4..."}
Jun 09 10:38:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Launching github runner"}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Terraform 1.4.6"}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Plan: 0 to add, 0 to change, 0 to destroy."}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Apply complete! Resources: 0 added, 0 changed, 0 destroyed."}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Outputs: 0"}
Jun 09 10:38:48 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Connected to acpid service."}
Jun 09 10:38:58 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"POST /repos/ORG/REPO/actions/runners/registration-token - 201 in 307ms"}
Jun 09 10:39:01 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:39:01.672Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"ready"}
Jun 09 10:39:20 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:39:20.270Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_started"}
Jun 09 10:43:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:43:44.840Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_ended","success":true}
Jun 09 10:43:47 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T10:43:47.170Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_started"}
Jun 09 11:02:44 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"date":"2023-06-09T11:02:44.588Z","level":"info","message":"runner status","repo":"https://github.com/ORG/REPO","status":"job_ended","success":true}
Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Unregistering runner cml-REPO-snapshot-t4-wi89wx1i-4src3nn4..."}
Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 403 in 159ms"}
Jun 09 11:04:45 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Retrying because of rate limit in 990 seconds"}
Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"GET /repos/ORG/REPO/actions/runners?per_page=100 - 200 in 260ms"}
Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"DELETE /repos/ORG/REPO/actions/runners/314 - 204 in 371ms"}
Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"\tSuccess"}
Jun 09 11:21:15 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"Waiting 10 seconds to destroy"}
Jun 09 11:21:26 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 cml.sh[40734]: {"level":"info","message":"runner status","reason":"timeout:120","status":"terminated"}
Jun 09 11:21:26 cml-REPO-snapshot-t4-wi89wx1i-4src3nn4 systemd[1]: cml.service: Succeeded

I am not facing any issue with aws ec2 instances and those terminate after idle-timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingSomething isn't workingcloud-gcpGoogle CloudGoogle Cloudcml-runnerSubcommandSubcommand

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions