Console may show "ghost" tasks when the event buffer is at capacity #230

asonix · 2021-12-17T23:24:02Z

Version

tokio v1.15.0
├── console-subscriber v0.1.0
│   └── current v0.1.0 (/home/asonix/Development/rust/current)
├── current v0.1.0 (/home/asonix/Development/rust/current)
├── h2 v0.3.9
│   ├── hyper v0.14.16
│   │   ├── hyper-timeout v0.4.1
│   │   │   └── tonic v0.6.2
│   │   │       ├── console-api v0.1.0
│   │   │       │   └── console-subscriber v0.1.0 (*)
│   │   │       └── console-subscriber v0.1.0 (*)
│   │   └── tonic v0.6.2 (*)
│   └── tonic v0.6.2 (*)
├── hyper v0.14.16 (*)
├── hyper-timeout v0.4.1 (*)
├── tokio-io-timeout v1.1.1
│   └── hyper-timeout v0.4.1 (*)
├── tokio-stream v0.1.8
│   ├── console-subscriber v0.1.0 (*)
│   ├── tonic v0.6.2 (*)
│   └── tower v0.4.11
│       └── tonic v0.6.2 (*)
├── tokio-util v0.6.9
│   ├── h2 v0.3.9 (*)
│   ├── tonic v0.6.2 (*)
│   └── tower v0.4.11 (*)
├── tonic v0.6.2 (*)
└── tower v0.4.11 (*)

Platform
Linux graystripe 5.13.0-22-generic tokio-rs/tokio#22~20.04.1-Ubuntu SMP Tue Nov 9 15:07:24 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Description
When observing tokio with tokio-console, it seems sometimes tasks are never reaped. I noticed this first in an actix-web application, and I've created a minimal reproduction with a normal tokio multi-threaded environment

I tried this code:

[package]
name = "current"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
console-subscriber = "0.1"
tokio = { version = "1.15", features = ["full", "tracing"] }

#[tokio::main]
async fn main() -> std::io::Result<()> {
    console_subscriber::init();

    println!("Hello");
    tokio::signal::ctrl_c().await?;

    for _ in 0..10000 {
        tokio::task::spawn_blocking(move || {
            println!("Spawned");
        })
        .await?;
    }

    tokio::signal::ctrl_c().await
}

Here's a screenshot of my console session:

It seems like this blocking task is marked as "running" and it does not go away, no matter how long I wait. I wonder if this is an issue with tokio's reporting, or if a blocking task truly has not been reaped.

Here's another screenshot:

In this one, we have a number of blocking tasks that have not been reaped, two of which seem to never have been polled.

The text was updated successfully, but these errors were encountered:

hawkw · 2021-12-18T05:56:40Z

It's possible that this is just a console issue. What happens if you dramatically increase the value of the console-subscriber layer's maximum event buffer capacity? Do you still see apparently-dangling tasks?

asonix · 2021-12-18T16:07:56Z

It does seem like increasing the event buffer capacity is the solution here, thanks!

hawkw · 2021-12-18T20:19:59Z

This should probably be a bug against the tokio-rs/console repository, rather than Tokio. I think what's happening here is that a bunch of tasks are being spawned, and they are eventually completing, but the event that records their completion is dropped because the console's event buffer is full. In that case, we never notice that the task has completed.

We might want to implement something where the console's aggregator task actively checks if task spans still exist, rather than just relying on a completion event. If we check for task existence and find that some task span is no longer active, but we missed the notification about its completion, we won't necessarily be able to record the correct time of completion, but we can at least remove it from the list of currently active tasks, so that we no longer show "ghost tasks" that should have completed.

It seems like a number of people using the console with real life applications are running into issues related to the console's event buffer being at capacity (e.g. #230, #234, etc). This commit bumps up the default capacity a bit as a quick fix. In the future, I think we should probably do things to avoid reaching the event buffer capacity as often --- I think we can change the aggregator/layer design a bit to avoid creating as many events. But, for now, this should hopefully give people a better default configuration.

hawkw · 2021-12-21T19:13:42Z

Working on a change (#238) that should hopefully fix this!

…hannel (#238) ## Motivation Currently, there are some rather bad issues that occur when the event buffer is at capacity and events are dropped. Completely *losing* data due to buffer capacity is relatively okay: if we set a bound on how much memory the console can use, and we don't record new things that occur when we've reached that limit, this is correct and acceptable behavior. However, the current design can result in *incorrect* data when events are lost due to the buffer being at capacity. This is because we currently record things like starting to poll a task/resource, ending a poll, dropping a task/resource/async op, and waker operations, as individual events in the buffer. This means that we can have a situation where the creation of a task was recorded, because there was buffer capacity at the time, but then when the task ended, the buffer was full, so we never recorded its termination. That results in the tasks appearing to run forever and never terminate --- see issue #230. Similarly, if we record the beginning of a poll, but drop the end-of-poll event because we're at capacity, this means we will (incorrectly) record a poll that goes on forever, which is obviously incorrect. I think this may also be the cause of the false positives with the lost waker lint (#149), if we record a waker drop but missed a waker clone that previously occurred. The change in #212 fixed one category of issue that occurs due to event buffer capacity --- when a task, resource, or async op's _creation_ event is dropped due to buffer capacity, we skip any subsequent events related to that task/resource/op. However, this doesn't fix issues where the subsequent events are the ones that are dropped. ## Solution This branch proposes a solution to this whole category of event buffer capacity related issues. Unfortunately, this requires rewriting a *lot* of `console-subscriber`'s internals. In the new approach, we now _only_ send events over the channel when creating a new task, resource, or async op. Those events now contain an `Arc` holding the stats for that entity. Another clone of the `Arc` is stored in the `tracing_subscriber::Registry`'s [span extensions] for the span corresponding to that entity. When the `ConsoleLayer` records subsequent events for a particular entity, such as starting/ending a poll, it looks up the span by ID, and updates the stats type stored in its extensions. The aggregator stores its clone of the `Arc` in a map of entities, just like it does currently, but no longer handles actually updating the stats; just building wire format updates from any tracked entities whose data was updated by the layer. This should fix all issues where dropping something due to event buffer capacity results in incorrect data. Once we have successfully recorded the *creation* of a task, resource, or async op, any subsequent updates to its stats are *guaranteed* to be reliable. If the channel is at capacity and we fail to record a new resource/task/op, we never create a stats extension for it, and we won't record anything for it at all. Otherwise, it will always have correct data recorded. When possible, the stats in the `Arc`ed stats are updated atomically. In some cases, this isn't easily possible, and some fields of the stats types are stored in a mutex. In particualr, this is required for storing timestamps. I don't really love that, but these mutices should be contented very infrequently. Stats aren't marked as having unset updates until after the stats inside the mutices have been updated, so the aggregator will not try to lock the mutex if the layer is currently updating it; instead, it will simply be included in the next update once the layer is no longer touching it. Mutices here will only be contended when multiple threads are updating a task's stats at the same time, which should occur very rarely...and in most cases, they *still* won't have to contend a mutex, since access to most of the mutices are guarded by an atomic variable for e.g. determining which thread actually was the last to complete a concurrent poll. The biggest performance downside of the mutices is probably not actually contention, but the additional heap allocation required when using `std::sync::Mutex`. However, since we have conditional `parking_lot` support, parking_lot can be used to avoid requiring additional allocations. In the future, it's probably possible to make more of this atomic by converting timestamps into integers and storing them in atomic variables. I haven't done this yet because both the protobuf timestamps and `std::time` timestamps are larger than a single 64-bit number and it might take a little extra work to ensure we can nicely fit them in an `AtomicUsize`...but we can probably do that later. [span extensions]: https://docs.rs/tracing-subscriber/latest/tracing_subscriber/registry/struct.SpanRef.html#method.extensions Signed-off-by: Eliza Weisman <eliza@buoyant.io>

hawkw · 2022-01-12T19:04:53Z

Fixed by #238.

asonix changed the title Dec 17, 2021

asonix closed this as completed Dec 18, 2021

hawkw transferred this issue from tokio-rs/tokio Dec 18, 2021

hawkw reopened this Dec 18, 2021

hawkw self-assigned this Dec 18, 2021

hawkw added A-instrumentation Area: instrumentation. C-subscriber Crate: console-subscriber. E-medium Effort: medium. S-bug Severity: bug labels Dec 18, 2021

hawkw changed the title ~~Tokio seemingly leaves dangling tasks~~ Console may show "ghost" tasks when the event buffer is at capacity Dec 18, 2021

hawkw mentioned this issue Dec 20, 2021

fix(console): increase default event buffer capacity a bit #235

Merged

pymongo mentioned this issue Dec 21, 2021

posible hyper task leak on console_subscriber server #233

Closed

hawkw mentioned this issue Dec 21, 2021

fix(subscriber): only send *new* tasks/resources/etc over the event channel #238

Merged

hawkw closed this as completed Jan 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Console may show "ghost" tasks when the event buffer is at capacity #230

Console may show "ghost" tasks when the event buffer is at capacity #230

asonix commented Dec 17, 2021

hawkw commented Dec 18, 2021

asonix commented Dec 18, 2021

hawkw commented Dec 18, 2021

hawkw commented Dec 21, 2021

hawkw commented Jan 12, 2022

Console may show "ghost" tasks when the event buffer is at capacity #230

Console may show "ghost" tasks when the event buffer is at capacity #230

Comments

asonix commented Dec 17, 2021

hawkw commented Dec 18, 2021

asonix commented Dec 18, 2021

hawkw commented Dec 18, 2021

hawkw commented Dec 21, 2021

hawkw commented Jan 12, 2022