Skip to content

wake_by_ref() has no effect when in raced condition with spurious waked #7589

@frostyplanet

Description

@frostyplanet

Version
tokio 1.47.1

Platform

  1. The lasted nightly of miri. from 1.89.0-nightly (e703dff8f 2025-06-11) to nightly-2025-06-04

  2. my laptop: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

uname -a
Linux x1 6.8.0-62-generic #65-Ubuntu SMP PREEMPT_DYNAMIC Mon May 19 17:15:03 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
  1. github runner macos-15

Description
crossfire is an async mpmc channel wrapped on top of crossbeam.

It looks like future is not waked with waker::wake_by_ref() on weak order platform. It's possible to have some relationship with spurious wake behavior.
The only thing I know is that, both multi-thread runtime and current-thread runtime have the issue.

On x86_64 cannot reproduce the problem, only happens on weak memory platforms

1. Miri

The problem first observed under weak memory emulation with Miri.

tracked by: frostyplanet/crossfire-rs#36

I was trying to integrate Miri into my workflow to detect deadlock. I have a branch "trace_log" for this (added log on codebase 2.0.26).
Because Miri is a single-thread interpreter, I can record the log (written to disk without and locking) under miri.
The rate of reproduction of a deadlock is very high. (>20%).
The log showed wake_by_ref has no effect, possible when some future got spurious waked up. Since the waker is not consumed from registry),
I check the waker will_wake()=true and return Poll::Pending, but at the same time, the other side of the channel saw the waker and attempted to wake
it. But the future does not run again.
It's strange that the problem is not reproducible for crossfire-2.0.26 's arm workflow. https://github.com/frostyplanet/crossfire-rs/actions/workflows/cron_master_arm.yml

The following is the last part of the log.

[2025-09-02 13:34:10.535895][INFO] wake rx LockedWaker(seq=16) 0
[2025-09-02 13:34:10.647736][INFO] rx: None recv
[2025-09-02 13:34:10.751009][INFO] tx: None send
[2025-09-02 13:34:10.927185][INFO] rx: new waker
[2025-09-02 13:34:10.945428][INFO] tx: None send
[2025-09-02 13:34:11.109501][INFO] rx: reg LockedWaker(seq=17)
[2025-09-02 13:34:11.244530][INFO] tx: new waker
[2025-09-02 13:34:11.296098][INFO] rx: cancel_waker true
[2025-09-02 13:34:11.422699][INFO] tx: reg LockedWaker(seq=15)
[2025-09-02 13:34:11.623061][INFO] tx: LockedWaker(seq=15) commit true
// spurious wake, channel is still full
[2025-09-02 13:34:11.796264][INFO] tx: will_wake LockedWaker(seq=15), return true 
[2025-09-02 13:34:11.935039][INFO] tx: LockedWaker(seq=15) commit true
[2025-09-02 13:34:12.042935][INFO] wake tx LockedWaker(seq=15) 1
// bug: message received, and wake up,  but tx is not run afterwards as it is supposed to. 
[2025-09-02 13:34:12.222726][INFO] rx: None recv
[2025-09-02 13:34:12.348995][INFO] rx: new waker
[2025-09-02 13:34:12.449376][INFO] rx: reg LockedWaker(seq=18)
[2025-09-02 13:34:12.555072][INFO] rx: LockedWaker(seq=18) commit true
// rx spurious waked, channel is still empty
[2025-09-02 13:34:12.676574][INFO] rx: will_wake LockedWaker(seq=18), return true 
[2025-09-02 13:34:12.783323][INFO] rx: LockedWaker(seq=18) commit true
// report deadlock

2. On Arm platform

Deadlock is observed from github arm (Macos-15) workflow

I'm preparing to release a refactor version of crossfire(2.1), although the code is alot more complex than 2.0, but it has a similar concept.
Deadlock is observed from arm workflow https://github.com/frostyplanet/crossfire-rs/actions/workflows/cron_dev_arm.yml on every run.
Always happens with blocking->async test case, because the probability of spurious wake is higher when we have only one receiver future runs in the runtime.
Tracked from issue: frostyplanet/crossfire-rs#35

In this scenario, I keep the log in memory and trigger dump to disk from SIGINT/SIGTERM, and already getting complete log message. (
One example with an intact log message is in the comment frostyplanet/crossfire-rs#35 (comment))

Here is some wakers do not actually wake

[2025-09-08 01:33:27.201211][DEBUG][ThreadId(2)] rx: reg waker(75 state=0)
[2025-09-08 01:33:27.201212][DEBUG][ThreadId(2)] rx: commit_waiting waker(75 state=1) 1
// spurious waked
[2025-09-08 01:33:27.201212][DEBUG][ThreadId(2)] rx: will_wake waker(75 state=1)
// waker saw and waked by sender
[2025-09-08 01:33:27.201217][DEBUG][ThreadId(5)] wake rx waker(75 state=3) Waked
// no log regarding waker 75 is found later.
...
[2025-09-08 01:33:27.272342][DEBUG][ThreadId(2)] rx: reg waker(2117 state=0)
[2025-09-08 01:33:27.272344][DEBUG][ThreadId(2)] rx: commit_waiting waker(2117 state=1) 1
// spurious waked
[2025-09-08 01:33:27.272346][DEBUG][ThreadId(2)] rx: will_wake waker(2117 state=1)
// waker saw and waked by sender
[2025-09-08 01:33:27.272355][DEBUG][ThreadId(6)] wake rx waker(2117 state=3) Waked
// no log regarding waker 2117 is found later.

3. Thoughts on how to analyze the problem

It looks like a serious problem that I have to resolve.
I'm not familier with the internal of tokio worker, perhaps forking the tokio repo and write a test with minimal code, with some log to print the internal state of the waker?

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-tokioArea: The main tokio crateC-bugCategory: This is a bug.M-runtimeModule: tokio/runtime

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions