Skip to content

ConcurrentBag.WorkStealingQueue.LocalPush() crash due to OverflowException #114817

Open
@antiduh

Description

@antiduh

Description

The following exception was observed from multithreaded software that uses ConcurrentBag:

System.OverflowException: Arithmetic operation resulted in an overflow.
    at System.Collections.Concurrent.ConcurrentBag`1.WorkStealingQueue.LocalPush(T item, Int64& emptyToNonEmptyListTransitionCount)

The software continuously processes RF samples and uses ConcurrentBag to cache empty buffers used by the various stages of the RF processing software. A high-level overview of how ConcurrentBag is used by the software is as follows:

  • The read thread allocates a buffer from ConcurrentBag, reads samples from the hardware, and then enqueues the buffer to the ingress queue.
  • The ingress thread reads from the ingress queue, allocates 6 buffers from CB, copies the ingress buffer's contents to the 6 new buffers, queues the 6 new buffers to 6 processing threads (one per thread), and returns the input buffer to CB.
  • The 6 processing threads read their buffers and process them, writing their buffers to their individual processing-complete queues.
  • The egress thread allocates an egress buffer from CB, collects the 6 buffers from the processing complete queues, sums the 6 buffers into the egress buffer, returns the 6 processing buffers to CB, and enqueues the egress buffer to the transmit queue.
  • The transmit thread reads from the transmit queue, writes to the hardware, and returns the egress buffer to CB.

Samples are received at 30 MHz and each buffer is 8192 samples, thus each buffer is worth 273.1 microseconds - the system reads a buffer from the hardware 3662 times a second.

Thus, each logical 'step' of the software requires 8 allocations and 8 deallocations from ConcurrentBag, involves 10 threads (4 of which that touch CB), and occurs 3662 times a second. At steady state, there are roughly 56 buffers in play at any time.

This software ran continuously for 8 days before the crash was observed.

...

Looking at the current code for LocalPush() on github, it appears that it does have some handling for overflow, but perhaps there is a corner case in this logic that is still buggy.

            internal void LocalPush(T item, ref long emptyToNonEmptyListTransitionCount)
            {
...
                    // Rare corner case (at most once every 2 billion pushes on this thread):
                    // We're going to increment the tail; if we'll overflow, then we need to reset our counts
                    if (tail == int.MaxValue)

Reproduction Steps

Discussed above. If I have time in the next few days, I'll see if I can write a program that reproduces.

Expected behavior

Doesn't crash with sustained use.

Actual behavior

Crashes with sustained use.

Regression?

No response

Known Workarounds

No response

Configuration

  • Program target: dotnet8 x64
  • Dotnet runtime: dotnet 8.0.14
  • Operating system: Ubuntu 24.10.1 x64
  • CPU: Amd 9950X - 16 core, 32 thread.

I would expect that this bug requires a multicore cpu to reproduce, likely at least 4 cores.

Other information

I'm not able to release memory dumps from this software when the crash occurs due to memory containing ITAR information.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions