Closed
Description
First off, thanks for adding GCP support to cml-runner.
While playing around with it, I noticed that my compute engine instances weren't being shutdown/terminated (i.e., VM powered off) or deleted. I experimented with --idle-timeout
and --single
, yet neither made a difference. The instance stays alive. This is unexpected given the following sentence from the docs on self-hosted runners:
After the job runs, the instance automatically shuts down.
And indeed, that's what I've observed with AWS instances. Those seem to terminate correctly.
Here's my Github workflow for extra context:
name: 'Train-in-the-cloud-GCP'
on:
workflow_dispatch:
jobs:
deploy-runner:
runs-on: [ubuntu-latest]
steps:
- uses: iterative/setup-cml@v1
- uses: actions/checkout@v2
- name: 'Deploy runner on GCP'
shell: bash
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
# Notice use of `GOOGLE_APPLICATION_CREDENTIALS_DATA` instead of
# `GOOGLE_APPLICATION_CREDENTIALS`. Contrary to what docs suggest, the
# latter causes problems for terraform.
GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS }}
run: |
cml-runner \
--cloud gcp \
--cloud-region europe-west1-b \
--cloud-type=n1-standard-1 \
--labels=cml-runner
model-training:
needs: deploy-runner
runs-on: [self-hosted, cml-runner]
container: docker://dvcorg/cml-py3:latest
steps:
- uses: actions/checkout@v2
- name: 'Train my dummy model'
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
run: |
echo "Training a super awesome model"
sleep 5
echo "Training complete"
Hope there's a way to ensure the same auto-shutdown behavior on GCP. As it is, the risk of getting smacked with an expensive bill for an idle GPU is just too real =p
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment