Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The operation was canceled. #2468

Open
magnetnation opened this issue Mar 1, 2023 · 28 comments
Open

The operation was canceled. #2468

magnetnation opened this issue Mar 1, 2023 · 28 comments
Labels
awaiting-customer-response bug Something isn't working

Comments

@magnetnation
Copy link

Describe the bug
Since last week actions in our different repositories started to fail with similar error:

#[error]The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

To Reproduce
Steps to reproduce the behavior:

  1. Go to Actions
  2. Run or re-run any job
  3. See error mentioned above

Expected behavior
Workflow should be running without cancellations.

Runner Version and Platform

Image: ubuntu-22.04
Version: 20230219.1
Included Software: https://github.com/actions/runner-images/blob/ubuntu22/20230219.1/images/linux/Ubuntu2204-Readme.md
Image Release: https://github.com/actions/runner-images/releases/tag/ubuntu22%2F20230219.1

OS of the machine running the runner?
Linux

What's not working?

Workflow is failing without any reason stated in the logs, apart from it has been cancelled.

Job Log Output

2023-03-01T06:49:56.3959504Z > @mgnation/mgdata@1.4.212 _bundle
2023-03-01T06:49:56.3960408Z > node build/bundle.js
2023-03-01T06:49:56.3960710Z
2023-03-01T06:49:56.8884234Z �[32mBundling has started�[0m
2023-03-01T06:49:56.9282005Z �[34mCopy results completed�[0m
2023-03-01T06:49:56.9290747Z �[34mAllow publish completed�[0m
2023-03-01T06:49:56.9295155Z build: 39.141ms
2023-03-01T06:49:56.9302039Z �[32mBundling has finished�[0m
2023-03-01T06:49:57.2702808Z
2023-03-01T06:49:57.2708891Z > @mgnation/mgdata@1.4.212 test
2023-03-01T06:49:57.2709823Z > node ./node_modules/nyc/bin/nyc.js node ./tmp/spec/runner.js
2023-03-01T06:49:57.2710190Z
2023-03-01T06:50:56.1865343Z ##[error]The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
2023-03-01T06:50:56.2969758Z ##[debug]Re-evaluate condition on job cancellation for step: 'npm install, build, and test'.
2023-03-01T06:50:56.2973354Z ##[debug]Skip Re-evaluate condition on runner shutdown.
2023-03-01T06:50:56.5597757Z ----------|---------|----------|---------|---------|-------------------
2023-03-01T06:50:56.5612354Z File | % Stmts | % Branch | % Funcs | % Lines | Uncovered Line #s
2023-03-01T06:50:56.5616821Z ----------|---------|----------|---------|---------|-------------------
2023-03-01T06:50:56.5618821Z All files | 0 | 0 | 0 | 0 |
2023-03-01T06:50:56.5620688Z ----------|---------|----------|---------|---------|-------------------
2023-03-01T06:50:56.6038836Z ##[error]The operation was canceled.
2023-03-01T06:50:56.6052060Z ##[debug]System.OperationCanceledException: The operation was canceled.
2023-03-01T06:50:56.6060563Z ##[debug] at System.Threading.CancellationToken.ThrowOperationCanceledException()
2023-03-01T06:50:56.6063113Z ##[debug] at GitHub.Runner.Sdk.ProcessInvoker.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken)
2023-03-01T06:50:56.6065859Z ##[debug] at GitHub.Runner.Common.ProcessInvokerWrapper.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken)
2023-03-01T06:50:56.6069900Z ##[debug] at GitHub.Runner.Worker.Handlers.DefaultStepHost.ExecuteAsync(IExecutionContext context, String workingDirectory, String fileName, String arguments, IDictionary`2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Boolean inheritConsoleHandler, String standardInInput, CancellationToken cancellationToken)
2023-03-01T06:50:56.6078953Z ##[debug] at GitHub.Runner.Worker.Handlers.ScriptHandler.RunAsync(ActionRunStage stage)
2023-03-01T06:50:56.6079410Z ##[debug] at GitHub.Runner.Worker.ActionRunner.RunAsync()
2023-03-01T06:50:56.6086736Z ##[debug] at GitHub.Runner.Worker.StepsRunner.RunStepAsync(IStep step, CancellationToken jobCancellationToken)
2023-03-01T06:50:56.6102555Z ##[debug]Finishing: npm install, build, and test
2023-03-01T06:50:56.6376954Z ##[debug]Evaluating condition for step: 'Post Use Node.js 16.x'
2023-03-01T06:50:56.6379186Z ##[debug]Skip evaluate condition on runner shutdown.
2023-03-01T06:50:56.6392406Z ##[debug]Evaluating condition for step: 'Post Run actions/checkout@v3'
2023-03-01T06:50:56.6392834Z ##[debug]Skip evaluate condition on runner shutdown.
2023-03-01T06:50:56.6821240Z ##[debug]Starting: Complete job
2023-03-01T06:50:56.6833228Z Uploading runner diagnostic logs
2023-03-01T06:50:56.7043348Z ##[debug]Starting diagnostic file upload.
2023-03-01T06:50:56.7046324Z ##[debug]Setting up diagnostic log folders.
2023-03-01T06:50:56.7306655Z ##[debug]Creating diagnostic log files folder.
2023-03-01T06:50:56.7419356Z ##[debug]Copying 1 worker diagnostic logs.
2023-03-01T06:50:56.7470424Z ##[debug]Copying 1 runner diagnostic logs.
2023-03-01T06:50:56.7539403Z ##[debug]Zipping diagnostic files.
2023-03-01T06:50:56.8067735Z ##[debug]Uploading diagnostic metadata file.
2023-03-01T06:50:56.8441811Z ##[debug]Diagnostic file upload complete.
2023-03-01T06:50:56.8445882Z Completed runner diagnostic log upload
2023-03-01T06:50:56.8450728Z Cleaning up orphan processes
2023-03-01T06:50:56.9100720Z ##[debug]Finishing: Complete job
2023-03-01T06:50:56.9301578Z ##[debug]Finishing: build (16.x)

@magnetnation magnetnation added the bug Something isn't working label Mar 1, 2023
@jarreds
Copy link

jarreds commented Mar 2, 2023

This has recently stated happening at an onerous frequency in our Bazel monorepo build action. Self-hosted runner. Happy to provide any troubleshooting info I can.

Summary:

Bazel Build
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
Bazel Build
The operation was canceled.

[2023-03-02 03:46:14Z INFO Runner] Received Ctrl-C signal, stop Runner.Listener and Runner.Worker.
[2023-03-02 03:46:14Z INFO HostContext] Runner will be shutdown for UserCancelled
[2023-03-02 03:48:24Z INFO Worker] Cancellation/Shutdown message received.
[2023-03-02 03:48:24Z INFO HostContext] Runner will be shutdown for UserCancelled
[2023-03-02 03:48:24Z INFO StepsRunner] Cancel current running step.

These events are happening despite us not clicking cancel on the action.

@ruvceskistefan
Copy link
Contributor

Hey all,

@magnetnation are you running hosted runner or using our image for self-hosted runner?
Also, could you provide us with job URLs(it's ok if the repo is private) so we can check the logs?

@magnetnation
Copy link
Author

magnetnation commented Mar 2, 2023

Hi,

We are using hosted runners, and here is an example of failed run.
And another repository failed run, and another one.

@matsest
Copy link

matsest commented Mar 6, 2023

Also see this a lot in various PowerShell commands that run on GitHub runners. Seems to be very unpredictable, but definitely seeing this a lot across multiple jobs and commands. Have not seen this previously to this extent.

@mathemaphysics
Copy link

mathemaphysics commented Mar 8, 2023

I'm seeing the same thing here with a fairly large CMake C++ project build. It's behaving as if it has no swap and runs out of memory. I've had it happen in a WSL docker container before. I ended up needing to allocate more RAM to the virtual machine. I have no idea how that would translate to this situation.

@magnetnation
Copy link
Author

Any update on this issue?
None of our actions can go through since then.

@matsest
Copy link

matsest commented Mar 14, 2023

Same here, still seeing these issues @ruvceskistefan

@igordrnobrega
Copy link

Any updated on this? I've this issue for a week now and nothing seems to be solving it

@nuhkoca
Copy link

nuhkoca commented Mar 20, 2023

We have also been facing this issue very frequently for weeks now even with a 8-core larger runner. How can I upload a shutdown log before the runner dies? It could have given an insight to me actually.

@credativ-dar
Copy link

credativ-dar commented Mar 20, 2023

@ruvceskistefan I think the tag "awaiting-customer-response" is not appropriate any more, could you remove it?

I think I have the same issue:
https://github.com/credativ/vali/actions/runs/4467520060/jobs/7847403788

 make: *** [Makefile:245: lint] Killed
Error: The operation was canceled.

Could this be an OOM killer?

Update:
For me it was 100% an out of memory situation, if you are using golangci-lint and upgraded to golang 1.20 this is your bug: golangci/golangci-lint#3470

Feature request for better OOM-reporting: https://github.com/orgs/community/discussions/50571

@edergillian-eeg
Copy link

This issue started happening after I transferred a repository from one organization to another. It was working when the repo was on the other organization and now all jobs are being cancelled, no exception.

There's no timeout set on the workflow file and the cancellation can happen anytime between ~30s and ~1m.

Both organizations have exactly the same configuration and paid plan. Other repositories' workflows in the organization run just fine, including other transferred repositories.

When I re-ran the failed jobs with debug log enabled, the only error I get on the Runner logs is this:

[2023-03-22 02:35:37Z ERR  GitHubActionsService] GET request to https://pipelines.actions.githubusercontent.com/<big_hash>/_apis/distributedtask/pools/2/messages?sessionId=<big_session_id>&lastMessageId=597&status=Online&runnerVersion=2.303.0 failed. HTTP Status: Forbidden, AFD Ref: Ref A: BFA7EEED124B494B83E44B4AB4DCA200 Ref B: BN3EDGE0703 Ref C: 2023-03-22T02:35:37Z
[2023-03-22 02:35:37Z INFO MessageListener] Runner OAuth token has been revoked. Unable to pull message.

@Milamary
Copy link

Having similar issue with Runner version: '2.303.0', Runner: 'ubuntu-latest-16-cores'.
It randomly fails builds with - Gradle build daemon disappeared unexpectedly (it may have been killed or may have crashed)

@bvdmitri
Copy link

@jsjoeio
Copy link

jsjoeio commented Apr 28, 2023

This is happening to us as well on a monorepo (Turborepo) in the Lint job (running with ESLint).

@nuhkoca
Copy link

nuhkoca commented Apr 29, 2023

We fixed it by limiting workers on a 8 Core machine with:

org.gradle.workers.max=4

@mathemaphysics
Copy link

@nuhcoka I'm confused because my CMake build should only run a single thread unless the -j flag is given (when build system is make or ninja). That's why I assumed it wasn't the issue.

Maybe the default setting changed.

@nuhkoca
Copy link

nuhkoca commented May 1, 2023

@mathemaphysics but Gradle by default uses all cores no?. The official description of the flag:

org.gradle.workers.max=(max # of worker processes)
When configured, Gradle will use a maximum of the given number of workers. See also performance command-line options. Default is number of CPU processors.

@sundarvenkata-EBI
Copy link

sundarvenkata-EBI commented May 7, 2023

We are having a similar issue as well where we have a process running for a really long time (like 6 hours) before it fails for no reason. See here

@netsgnut
Copy link

We are having a similar issue as well where we have a process running for a really long time (like 6 hours) before it fails for no reason. See here

I think your case may be different from others here. It is related to the 6-hour job execution limit instead. From the docs,

Job execution time - Each job in a workflow can run for up to 6 hours of execution time. If a job reaches this limit, the job is terminated and fails to complete.

@deathemperor
Copy link

deathemperor commented Jun 1, 2023

We're constantly having issue with this. it's unusable. until it's fixed we have to run it manually every time. link to action https://github.com/papaya-insurtech/berry/actions/runs/5131060157/jobs/9230656685

@bvdmitri
Copy link

bvdmitri commented Jun 2, 2023

We were able to fix it on our side, we had a mistake in our codebase that caused enormous amount of unnecessary allocations. We no longer have any issues after the mistake has been fixed. It looks like the "The operation was cancelled" error was definitely an OOM error, but getting a better error message would be nicer.

@nuhkoca
Copy link

nuhkoca commented Jun 2, 2023

We were able to fix it on our side, we had a mistake in our codebase that caused enormous amount of unnecessary allocations. We no longer have any issues after the mistake has been fixed. It looks like the "The operation was cancelled" error was definitely an OOM error, but getting a better error message would be nicer.

Hey @bvdmitri, what was the fix exactly? Maybe it can shed some light on ours, too 🙂

@bvdmitri
Copy link

bvdmitri commented Jun 2, 2023

@nuhkoca
Well, we refactored our code such that it allocated and used less memory.
Nothing specific to the Github actions runner.

@a1300
Copy link

a1300 commented Jul 26, 2023

For me roughly every third github action failed. Increasing the swap space to 10GB on the ubuntu-latest runner with github action pierotofy/set-swap-space fixed the problem for me.

      - name: Set Swap Space
        uses: pierotofy/set-swap-space@master
        with:
          swap-size-gb: 10

@nuhkoca
Copy link

nuhkoca commented Jul 26, 2023

We also switched over to a new next gen garbage collector and it fixed most OOM problems.

instead of -XX:+UseParallelGC

-XX:+UseConcMarkSweepGC

This flag is needed to activate the CMS Collector in the first place. By default, HotSpot uses the Throughput Collector instead.

-XX:+UseParNewGC

When the CMS collector is used, this flag activates the parallel execution of young generation GCs using multiple threads. It may seem surprising at first that we cannot simply reuse the flag -XX:+UseParallelGC known from the Throughput Collector, because conceptually the young generation GC algorithms used are the same. However, since the interplay between the young generation GC algorithm and the old generation GC algorithm is different with the CMS collector, there are two different implementations of young generation GC and thus two different flags.

https://www.codecentric.de/wissens-hub/blog/useful-jvm-flags-part-7-cms-collector

rgrewe added a commit to greenbone/gos-ci that referenced this issue Aug 9, 2023
Current runners got newer Ubuntu, but this breaks our piuparts run.
Program lsof is running with 100% CPU for ~15 minutes and runner
 cancel that job after that.

* actions/runner#2468
* actions/runner-images#7188
r-bk added a commit to r-bk/rsdns that referenced this issue Oct 22, 2023
miri runs are failing lately. Checking if this is related to bigger RAM
consumption of miri in latest releases. See [1] for the proposed
solution.

---
[1] actions/runner#2468 (comment)
ukd1 added a commit to ukd1/pcbflow that referenced this issue Jan 16, 2024
@kaanx022
Copy link

for me nothing works. I couldn't even find an image, or an example that works. Tried in a matrix, java and SDK combinations, all images, all platforms, all the different things.

This is so frustrating. Why won't you people give us an example that just works ? Every single time I have to build a CI from scratch every few years, I have to go through this god damn rabbit hole. What is wrong with you people

https://github.com/kaanx022/kaan/actions/runs/9042209780/job/24848448703

@kaanx022
Copy link

ok on ubuntu some of them are actually running with the perfect setup, the big matrix table is here:

https://github.com/kaanx022/kaan/actions/runs/9042456492/job/24848994152

But we shouldn't have to run a matrix of ALL POSSIBLE COMBINATIONS just to figure out the one that works.

@wffurr
Copy link

wffurr commented Sep 11, 2024

This seems to still be happening, e.g. with sudo apt-get install -y libtinfo5. See also actions/runner-images#9959. Removing the needrestart service seems to work around the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-customer-response bug Something isn't working
Projects
None yet
Development

No branches or pull requests