Closed
Description
openedon Nov 11, 2022
- Patch
cml runner
idleness detection logic #1299 - Add missing
await
and increase polling delay #1300 - Verify that machines are being terminated on failure (to be confirmed with this run)
Traces
1
$ journalctl --unit cml
-- Logs begin at Sat 2022-10-08 21:59:41 UTC, end at Thu 2022-11-17 14:42:27 UTC. --
Nov 17 10:53:35 ip-172-31-81-177 systemd[1]: Started cml.service.
Nov 17 10:53:35 ip-172-31-81-177 cml.sh[2214]: % Total % Received % Xferd Average Speed Time Time Time Current
Nov 17 10:53:35 ip-172-31-81-177 cml.sh[2214]: Dload Upload Total Spent Left Speed
Nov 17 10:53:35 ip-172-31-81-177 cml.sh[2214]: [158B blob data]
Nov 17 10:53:36 ip-172-31-81-177 cml.sh[2214]: {"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."}
Nov 17 10:53:36 ip-172-31-81-177 cml.sh[2214]: {"level":"info","message":"Preparing workdir /tmp/tmp.FFQVuNxrbe/.cml/cml-679w0e9r8o..."}
Nov 17 10:53:36 ip-172-31-81-177 cml.sh[2214]: {"level":"info","message":"Launching github runner"}
Nov 17 10:53:41 ip-172-31-81-177 cml.sh[2214]: {"level":"info","message":"Terraform 1.3.2"}
Nov 17 10:53:41 ip-172-31-81-177 cml.sh[2214]: {"level":"info","message":"Plan: 0 to add, 0 to change, 0 to destroy."}
Nov 17 10:53:41 ip-172-31-81-177 cml.sh[2214]: {"level":"info","message":"Apply complete! Resources: 0 added, 0 changed, 0 destroyed."}
Nov 17 10:53:41 ip-172-31-81-177 cml.sh[2214]: {"level":"info","message":"Outputs: 0"}
Nov 17 10:53:41 ip-172-31-81-177 cml.sh[2214]: {"level":"info","message":"Connected to acpid service."}
Nov 17 10:53:51 ip-172-31-81-177 cml.sh[2214]: {"date":"2022-11-17T10:53:51.690Z","level":"info","message":"runner status","repo":"https://github.com/iterative/cml-textual-inversion","status":"ready"}
Nov 17 10:58:23 ip-172-31-81-177 cml.sh[2214]: {"level":"info","message":"Retrying after 153 seconds!"}
Nov 17 11:00:56 ip-172-31-81-177 cml.sh[2214]: {"level":"info","message":"Retrying after 0 seconds!"}
Nov 17 11:00:56 ip-172-31-81-177 cml.sh[2214]: {"level":"info","message":"Retrying after 0 seconds!"}
Nov 17 11:00:56 ip-172-31-81-177 cml.sh[2214]: {"level":"info","message":"Retrying after 0 seconds!"}
Nov 17 11:00:56 ip-172-31-81-177 cml.sh[2214]: {"level":"info","message":"Retrying after 0 seconds!"}
Nov 17 11:04:40 ip-172-31-81-177 cml.sh[2214]: {"level":"info","message":"Retrying after 3377 seconds!"}
Nov 17 12:00:57 ip-172-31-81-177 cml.sh[2214]: {"date":"Thu Nov 17 2022 12:00:57 GMT+0000 (Coordinated Universal Time)","error":{"name":"HttpError","request":{"headers":{"accept":"application/vnd.github.v3+json","authorization":"to
Nov 17 12:00:57 ip-172-31-81-177 cml.sh[2214]: {"level":"info","message":"Unregistering runner cml-679w0e9r8o..."}
Nov 17 12:00:57 ip-172-31-81-177 cml.sh[2214]: {"level":"info","message":"Retrying after 0 seconds!"}
Nov 17 12:00:57 ip-172-31-81-177 cml.sh[2214]: {"level":"info","message":"Retrying after 0 seconds!"}
Nov 17 12:00:57 ip-172-31-81-177 cml.sh[2214]: {"level":"info","message":"Retrying after 0 seconds!"}
Nov 17 12:00:57 ip-172-31-81-177 cml.sh[2214]: {"level":"info","message":"Retrying after 0 seconds!"}
Nov 17 12:00:57 ip-172-31-81-177 cml.sh[2214]: {"level":"info","message":"Retrying after 0 seconds!"}
Nov 17 12:00:58 ip-172-31-81-177 cml.sh[2214]: {"level":"error","message":"\tFailed: Bad request - Runner \"cml-679w0e9r8o\" is still running a job\""}
Nov 17 12:00:58 ip-172-31-81-177 cml.sh[2214]: {"level":"info","message":"Waiting 10 seconds to destroy"}
Nov 17 12:01:00 ip-172-31-81-177 systemd[1]: cml.service: Main process exited, code=exited, status=1/FAILURE
Nov 17 12:01:02 ip-172-31-81-177 systemd[1]: cml.service: Failed with result 'exit-code'.
2
-- Logs begin at Sat 2022-10-08 21:59:41 UTC, end at Fri 2022-12-30 16:03:43 UTC. --
Dec 30 04:51:17 ip-172-31-80-146 systemd[1]: Started cml.service.
Dec 30 04:51:17 ip-172-31-80-146 cml.sh[2897]: % Total % Received % Xferd Average Speed Time Time Time Current
Dec 30 04:51:17 ip-172-31-80-146 cml.sh[2897]: Dload Upload Total Spent Left Speed
Dec 30 04:51:17 ip-172-31-80-146 cml.sh[2897]: [158B blob data]
Dec 30 04:51:18 ip-172-31-80-146 cml.sh[2897]: {"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."}
Dec 30 04:51:18 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Preparing workdir /tmp/tmp.DLtgmegfm3/.cml/cml-r6af2ku8y0-joxcs4iu-5bwnb9sg..."}
Dec 30 04:51:18 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Launching github runner"}
Dec 30 04:51:23 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Terraform 1.3.2"}
Dec 30 04:51:24 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Plan: 0 to add, 0 to change, 0 to destroy."}
Dec 30 04:51:24 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Apply complete! Resources: 0 added, 0 changed, 0 destroyed."}
Dec 30 04:51:24 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Outputs: 0"}
Dec 30 04:51:24 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Connected to acpid service."}
Dec 30 04:51:34 ip-172-31-80-146 cml.sh[2897]: {"date":"2022-12-30T04:51:34.742Z","level":"info","message":"runner status","repo":"https://github.com/iterative/cml-textual-inversion","status":"ready"}
Dec 30 05:01:35 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Unregistering runner cml-r6af2ku8y0-joxcs4iu-5bwnb9sg..."}
Dec 30 05:01:35 ip-172-31-80-146 cml.sh[2897]: {"level":"warn","message":"\tCancelling shutdown: Bad request - Runner \"cml-r6af2ku8y0-joxcs4iu-5bwnb9sg\" is still running a job\""}
Dec 30 05:11:37 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Unregistering runner cml-r6af2ku8y0-joxcs4iu-5bwnb9sg..."}
Dec 30 05:11:37 ip-172-31-80-146 cml.sh[2897]: {"level":"warn","message":"\tCancelling shutdown: Bad request - Runner \"cml-r6af2ku8y0-joxcs4iu-5bwnb9sg\" is still running a job\""}
Dec 30 05:21:38 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Unregistering runner cml-r6af2ku8y0-joxcs4iu-5bwnb9sg..."}
Dec 30 05:21:39 ip-172-31-80-146 cml.sh[2897]: {"level":"warn","message":"\tCancelling shutdown: Bad request - Runner \"cml-r6af2ku8y0-joxcs4iu-5bwnb9sg\" is still running a job\""}
Dec 30 05:31:40 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Unregistering runner cml-r6af2ku8y0-joxcs4iu-5bwnb9sg..."}
Dec 30 05:31:40 ip-172-31-80-146 cml.sh[2897]: {"level":"warn","message":"\tCancelling shutdown: Bad request - Runner \"cml-r6af2ku8y0-joxcs4iu-5bwnb9sg\" is still running a job\""}
Dec 30 05:41:41 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Unregistering runner cml-r6af2ku8y0-joxcs4iu-5bwnb9sg..."}
Dec 30 05:41:42 ip-172-31-80-146 cml.sh[2897]: {"level":"warn","message":"\tCancelling shutdown: Bad request - Runner \"cml-r6af2ku8y0-joxcs4iu-5bwnb9sg\" is still running a job\""}
Dec 30 05:51:43 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Unregistering runner cml-r6af2ku8y0-joxcs4iu-5bwnb9sg..."}
Dec 30 05:51:43 ip-172-31-80-146 cml.sh[2897]: {"level":"warn","message":"\tCancelling shutdown: Bad request - Runner \"cml-r6af2ku8y0-joxcs4iu-5bwnb9sg\" is still running a job\""}
Dec 30 06:01:45 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Unregistering runner cml-r6af2ku8y0-joxcs4iu-5bwnb9sg..."}
Dec 30 06:01:45 ip-172-31-80-146 cml.sh[2897]: {"level":"warn","message":"\tCancelling shutdown: Bad request - Runner \"cml-r6af2ku8y0-joxcs4iu-5bwnb9sg\" is still running a job\""}
Dec 30 06:11:46 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Unregistering runner cml-r6af2ku8y0-joxcs4iu-5bwnb9sg..."}
Dec 30 06:11:47 ip-172-31-80-146 cml.sh[2897]: {"level":"warn","message":"\tCancelling shutdown: Bad request - Runner \"cml-r6af2ku8y0-joxcs4iu-5bwnb9sg\" is still running a job\""}
Dec 30 06:21:48 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Unregistering runner cml-r6af2ku8y0-joxcs4iu-5bwnb9sg..."}
Dec 30 06:21:48 ip-172-31-80-146 cml.sh[2897]: {"level":"warn","message":"\tCancelling shutdown: Bad request - Runner \"cml-r6af2ku8y0-joxcs4iu-5bwnb9sg\" is still running a job\""}
Dec 30 06:24:10 ip-172-31-80-146 cml.sh[2897]: {"date":"2022-12-30T06:24:10.677Z","level":"info","message":"runner status","repo":"https://github.com/iterative/cml-textual-inversion","status":"job_ended","success":true}
Dec 30 06:31:49 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Unregistering runner cml-r6af2ku8y0-joxcs4iu-5bwnb9sg..."}
Dec 30 06:31:50 ip-172-31-80-146 cml.sh[2897]: {"level":"warn","message":"\tCancelling shutdown: Bad request - Runner \"cml-r6af2ku8y0-joxcs4iu-5bwnb9sg\" is still running a job\""}
Dec 30 06:41:51 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Unregistering runner cml-r6af2ku8y0-joxcs4iu-5bwnb9sg..."}
Dec 30 06:41:52 ip-172-31-80-146 cml.sh[2897]: {"level":"warn","message":"\tCancelling shutdown: Bad request - Runner \"cml-r6af2ku8y0-joxcs4iu-5bwnb9sg\" is still running a job\""}
Dec 30 06:51:02 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Retrying after 1157 seconds!"}
Dec 30 06:51:16 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Retrying after 1143 seconds!"}
Dec 30 06:51:53 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Unregistering runner cml-r6af2ku8y0-joxcs4iu-5bwnb9sg..."}
Dec 30 06:51:53 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Retrying after 1106 seconds!"}
Dec 30 07:10:19 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Retrying after 0 seconds!"}
Dec 30 07:10:19 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Retrying after 0 seconds!"}
Dec 30 07:10:19 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Retrying after 0 seconds!"}
Dec 30 07:10:19 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Retrying after 0 seconds!"}
Dec 30 07:10:19 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Retrying after 0 seconds!"}
Dec 30 07:10:19 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Retrying after 0 seconds!"}
Dec 30 07:10:19 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Retrying after 0 seconds!"}
Dec 30 07:10:19 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Retrying after 0 seconds!"}
Dec 30 07:10:19 ip-172-31-80-146 cml.sh[2897]: {"level":"error","message":"\tFailed: API rate limit exceeded for user ID 11387611."}
Dec 30 07:10:19 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"\tSuccess"}
Dec 30 07:10:19 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Waiting 10 seconds to destroy"}
Dec 30 07:10:19 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Retrying after 0 seconds!"}
Dec 30 07:10:19 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Retrying after 0 seconds!"}
Dec 30 07:10:19 ip-172-31-80-146 cml.sh[2897]: {"date":"Fri Dec 30 2022 07:10:19 GMT+0000 (Coordinated Universal Time)","error":{"name":"HttpError","request":{"headers":{"accept":"application/vnd.github.v3+json","authorization":"token [REDACTED]","user-agent":"octokit-rest.js/18.0.0 octokit-core.js/3.6.0 Node.js/16.17.1 (linux; x64)"},"method":"GET","request":{"agent":{},"retryCount":6},"url":"https://api.github.com/repos/iterative/cml-textual-inversion/actions/jobs/10359857170"},"response":{"data":{"documentation_url":"https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting","message":"API rate limit exceeded for user ID 11387611."},"headers":{"access-control-allow-origin":"*","access-control-expose-headers":"ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset","connection":"close","content-encoding":"gzip","content-security-policy":"default-src 'none'","content-type":"application/json; charset=utf-8","date":"Fri, 30 Dec 2022 07:10:19 GMT","referrer-policy":"origin-when-cross-origin, strict-origin-when-cross-origin","server":"GitHub.com","strict-transport-security":"max-age=31536000; includeSubdomains; preload","transfer-encoding":"chunked","vary":"Accept-Encoding, Accept, X-Requested-With","x-accepted-oauth-scopes":"repo","x-content-type-options":"nosniff","x-frame-options":"deny","x-github-media-type":"github.v3; format=json","x-github-request-id":"A90A:2F36:49D55C8:9784945:63AE8EDB","x-oauth-scopes":"admin:enterprise, admin:org, admin:org_hook, admin:public_key, admin:repo_hook, audit_log, codespace, delete:packages, delete_repo, gist, notifications, project, read:gpg_key, read:ssh_signing_key, repo, user, workflow, write:discussion, write:packages","x-ratelimit-limit":"5000","x-ratelimit-remaining":"0","x-ratelimit-reset":"1672384219","x-ratelimit-resource":"core","x-ratelimit-used":"5001","x-xss-protection":"0"},"status":403,"url":"https://api.github.com/repos/iterative/cml-textual-inversion/actions/jobs/10359857170"},"status":403},"exception":true,"level":"error","message":"unhandledRejection: API rate limit exceeded for user ID 11387611.\nHttpError: API rate limit exceeded for user ID 11387611.\n at /usr/lib/node_modules/@dvcorg/cml/node_modules/@octokit/request/dist-node/index.js:86:21\n at runMicrotasks (<anonymous>)\n at processTicksAndRejections (node:internal/process/task_queues:96:5)\n at async Job.doExecute (/usr/lib/node_modules/@dvcorg/cml/node_modules/bottleneck/light.js:405:18)","os":{"loadavg":[1.1,1.26,1.27],"uptime":8394.9},"process":{"argv":["/usr/bin/node","/usr/bin/cml","runner","--name","cml-r6af2ku8y0-joxcs4iu-5bwnb9sg","--labels","cml","--idle-timeout","600","--driver","github","--repo","https://github.com/iterative/cml-textual-inversion","--token","...","--tf-resource","..."],"cwd":"/","execPath":"/usr/bin/node","gid":0,"memoryUsage":{"arrayBuffers":1316610,"external":19109170,"heapTotal":37875712,"heapUsed":32279288,"rss":203804672},"pid":3022,"uid":0,"version":"v16.17.1"},"stack":"HttpError: API rate limit exceeded for user ID 11387611.\n at /usr/lib/node_modules/@dvcorg/cml/node_modules/@octokit/request/dist-node/index.js:86:21\n at runMicrotasks (<anonymous>)\n at processTicksAndRejections (node:internal/process/task_queues:96:5)\n at async Job.doExecute (/usr/lib/node_modules/@dvcorg/cml/node_modules/bottleneck/light.js:405:18)","trace":[{"column":21,"file":"/usr/lib/node_modules/@dvcorg/cml/node_modules/@octokit/request/dist-node/index.js","function":null,"line":86,"method":null,"native":false},{"column":null,"file":null,"function":"runMicrotasks","line":null,"method":null,"native":false},{"column":5,"file":"node:internal/process/task_queues","function":"processTicksAndRejections","line":96,"method":null,"native":false},{"column":18,"file":"/usr/lib/node_modules/@dvcorg/cml/node_modules/bottleneck/light.js","function":"async Job.doExecute","line":405,"method":"doExecute","native":false}]}
Dec 30 07:10:19 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Retrying after 0 seconds!"}
Dec 30 07:10:19 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Retrying after 0 seconds!"}
Dec 30 07:10:19 ip-172-31-80-146 cml.sh[2897]: {"level":"info","message":"Retrying after 0 seconds!"}
Dec 30 07:10:22 ip-172-31-80-146 systemd[1]: cml.service: Main process exited, code=exited, status=1/FAILURE
Dec 30 07:10:23 ip-172-31-80-146 systemd[1]: cml.service: Failed with result 'exit-code'.
Debugging CML “in production”
- Create a new branch on this repository, e.g.
debug
- Edit
bin/cml.js
on the new branch and replace the shebang with#!/usr/bin/node --inspect-brk
- Pass
--cml-version=github:iterative/cml#debug
1 tocml runner launch
to use the new branch - Clone & checkout the
debug
branch locally and open it with Visual Studio Code, then create alaunch.json
file:{ "version": "0.2.0", "configurations": [ { "address": "localhost:9229/2faa730c-1670-42d8-8bb5-5a5630c0be1d", "localRoot": "${workspaceFolder}", "name": "Attach to Remote", "port": 9229, "remoteRoot": "/usr/lib/node_modules/@dvcorg/cml", "request": "attach", "skipFiles": [ "<node_internals>/**" ], "type": "node" } ] }
- Run
ssh -L 9229:localhost:9229 ubuntu@IP_ADDRESS
locally to redirect the debugging port - Use Visual Studio Code debugger to connect and resume the execution
Footnotes
-
Pass also e.g.
--cloud-startup-script=$(echo 'curl https://github.com/0x2b3bfa0.keys >> /home/ubuntu/.ssh/authorized_keys' | base64 -w 0)
to ensure SSH access. ↩
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment