mpsc stream is currently a pessimization

**TLDR**
As of `rustc 1.22.0-nightly (981ce7d8d 2017-09-03)` cloning a `mpsc::Sender` speeds up spsc workloads. My benchmarking seems to indicate that this is due to:

1. Contention on shared counters in the node cache ([push](https://github.com/rust-lang/rust/blob/26d4308c6acbbde642be4621f5e3487b24ed8bd1/src/libstd/sync/mpsc/spsc_queue.rs#L184), [pop](https://github.com/rust-lang/rust/blob/26d4308c6acbbde642be4621f5e3487b24ed8bd1/src/libstd/sync/mpsc/spsc_queue.rs#L139)).

2. False sharing between the [producer](https://github.com/rust-lang/rust/blob/26d4308c6acbbde642be4621f5e3487b24ed8bd1/src/libstd/sync/mpsc/spsc_queue.rs#L61) and [consumer](https://github.com/rust-lang/rust/blob/26d4308c6acbbde642be4621f5e3487b24ed8bd1/src/libstd/sync/mpsc/spsc_queue.rs#L57) fields in the underlying queue.

3. False sharing or contention in the [wakeup code](https://github.com/rust-lang/rust/blob/26d4308c6acbbde642be4621f5e3487b24ed8bd1/src/libstd/sync/mpsc/stream.rs#L46).

2 can be fixed by simply changing the alignment of the offending members so that the producer and consumer parts are on separate cache lines. 1 can be fixed with a small rewrite so that the queue only tracks its cache size at the consumer (a version of this code can be found [here](https://github.com/JLockerman/std_spsc_is_slow/blob/16d4548c63ef9faf3ec0130f4439698e8b774180/src/spsc2.rs#L190)). 3 can be mitigated by reworking the alignment, but I am not sure if it's a full fix, rewriting the the send and recv logic so that no counter may also fix the issue (a version of the code that does this can be found \([here](https://github.com/JLockerman/std_spsc_is_slow/blob/16d4548c63ef9faf3ec0130f4439698e8b774180/src/stream2.rs#L240) and [here](https://github.com/JLockerman/std_spsc_is_slow/blob/16d4548c63ef9faf3ec0130f4439698e8b774180/src/stream2.rs#L288)\), but it is not complete).

---

In the following benchmark:

```rust
fn bench_mpsc_stream() -> f64 {
    let (sender, reciever) = channel();
    bench_spsc(sender, reciever)
}

fn bench_mpsc_shared() -> f64 {
    let (sender, reciever) = channel();
    // this clone forces the queue into shared mode and makes the benchmark faster
    let _clone = sender.clone();
    bench_spsc(sender, reciever)
}

const COUNT: u64 = 10_000_000;

fn bench_spsc(tx: Sender<u64>, rx: Receiver<u64>) -> f64 {
    // ensure that the channel is not in Once mode
    tx.send(0).unwrap();
    tx.send(0).unwrap();
    rx.recv().unwrap();
    rx.recv().unwrap();

    let start = ::std::time::Instant::now();
    scope(|scope| {
        scope.spawn(move || {
            for x in 0..(COUNT*2) {
                let _ = black_box(tx.send(x));
            }
        });

        for _i in 0..(COUNT*2) {
            let _ = black_box(rx.recv().unwrap());
        }
    });
    let d = start.elapsed();

    nanos(d) / ((COUNT*2) as f64)
}
```
\(derived from [crossbeam](https://github.com/crossbeam-rs/crossbeam/blob/2b03f1fd8c6dd5a045f3b77a0d2bbebff4776ade/src/bin/bench.rs), full code can be found [in this repo](https://github.com/JLockerman/std_spsc_is_slow)\)

on an early 2014 Intel i5 running macos, using `rustc 1.22.0-nightly (981ce7d8d 2017-09-03)` or `rustc 1.20.0 (f3d6973f4 2017-08-27)`, `stream` runs at roughly `201 ns/send` while `shared` runs at `134 ns/send`.
Running on linux on EC2 and a raspberry pi show similar behavior.

The underlying datastructures show some difference in performance (`spsc queue 75 ns/send`, `mpsc 59 ns/send`), though not enough to fully explain the difference. Though I have not yet looked enough into the `mpsc` code enough to be sure of the difference, I did find potential improvements for `spsc`:

|    SPSC Performance    | Unaligned    | Cache Aligned  |
|------------------------|--------------|----------------|
| Default Cache          | 75 `ns/send` |  97 `ns/send`  |
| No Cache               | 46 `ns/send` |  47 `ns/send`  |
| Unbounded Cache        | 30 `ns/send` |  12 `ns/send`  |
| Low Contention Cache   | 35 `ns/send` |   9 `ns/send`  |

Where Unaligned is the current struct layout, Cache Aligned aligns the [consumer fields](https://github.com/rust-lang/rust/blob/26d4308c6acbbde642be4621f5e3487b24ed8bd1/src/libstd/sync/mpsc/spsc_queue.rs#L57) and [producer fields](https://github.com/rust-lang/rust/blob/26d4308c6acbbde642be4621f5e3487b24ed8bd1/src/libstd/sync/mpsc/spsc_queue.rs#L61) to their own cache line (code is [here](https://github.com/JLockerman/std_spsc_is_slow/blob/16d4548c63ef9faf3ec0130f4439698e8b774180/src/spsc.rs#L26)\).
Default cache is the current node cache implementation in std, No Cache disables the Node cache entirely \([producer code](https://github.com/JLockerman/std_spsc_is_slow/blob/16d4548c63ef9faf3ec0130f4439698e8b774180/src/spsc.rs#L238), [consumer code](https://github.com/JLockerman/std_spsc_is_slow/blob/16d4548c63ef9faf3ec0130f4439698e8b774180/src/spsc.rs#L284)\), and Unbounded Cache should be self explanatory.

Low Contention Cache rewrites the cache bounding logic to be done entirely on the consumer side.
Instead of keeping a count of how many nodes are in the cache, it keeps track of how many nodes
are marked as _eligible_ to be cached, and only and always caches those nodes so marked \(code for this can be found [here](https://github.com/JLockerman/std_spsc_is_slow/blob/16d4548c63ef9faf3ec0130f4439698e8b774180/src/spsc2.rs#L190), my experimental implementation stores the eligible flag in the [node](https://github.com/JLockerman/std_spsc_is_slow/blob/16d4548c63ef9faf3ec0130f4439698e8b774180/src/spsc2.rs#L16), though it could also be done by stealing a bit from the pointer\).

Some of these performance improvements translate to the full `stream` structure:

|   Stream Performance   |   Unaligned   | Cache Aligned  |
|------------------------|---------------|----------------|
| Default Cache          | 205 `ns/send` |  212 `ns/send` |
| No Cache               | 103 `ns/send` |   91 `ns/send` |
| Low Contention Cache   | 179 `ns/send` |  102 `ns/send` |

But to fully see the benefits, contention with the [wakeup](https://github.com/rust-lang/rust/blob/26d4308c6acbbde642be4621f5e3487b24ed8bd1/src/libstd/sync/mpsc/stream.rs#L46) state needs to be removed

|  Stream2 Performance   |   Unaligned   | Cache Aligned  |
|------------------------|---------------|----------------|
| Default Cache          | 131 `ns/send` |  112 `ns/send` |
| No Cache               |  91 `ns/send` |   58 `ns/send` |
| Low Contention Cache   | 103 `ns/send` |   27 `ns/send` |

\(The numbers were collected from a version of stream that [does not a counter at all](https://github.com/rust-lang/rust/blob/26d4308c6acbbde642be4621f5e3487b24ed8bd1/src/libstd/sync/mpsc/stream.rs#L46) I've gotten similar numbers from simply putting every field in stream on their own cache line. I _think_ there should be a layout which uses exactly 2 lines, one for the producer and one for the consumer, with similar performance, but I have not done enough benchmarking to confirm it yet\).

All the code code to reproduce these numbers can be found in [this repo](https://github.com/JLockerman/std_spsc_is_slow), along with the number for a raspberry pi.
Note that the raspberry pi seemed to be undergoing throttling as the benchmark ran, so numbers gathered later in the run are significantly worse than the one gathered at the beginning.

My apologies for the length and quality of writing.

cc @alexcrichton I believe.


SPSC Performance	Unaligned	Cache Aligned
Default Cache	75 `ns/send`	97 `ns/send`
No Cache	46 `ns/send`	47 `ns/send`
Unbounded Cache	30 `ns/send`	12 `ns/send`
Low Contention Cache	35 `ns/send`	9 `ns/send`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mpsc stream is currently a pessimization #44512

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stream Performance	Unaligned	Cache Aligned
Default Cache	205 `ns/send`	212 `ns/send`
No Cache	103 `ns/send`	91 `ns/send`
Low Contention Cache	179 `ns/send`	102 `ns/send`

Stream2 Performance	Unaligned	Cache Aligned
Default Cache	131 `ns/send`	112 `ns/send`
No Cache	91 `ns/send`	58 `ns/send`
Low Contention Cache	103 `ns/send`	27 `ns/send`

mpsc stream is currently a pessimization #44512

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions