Adventures In Message Passing (with Rust)

10 Jul, 2025

Single-producer/single-consumer (SPSC) queues are essential for achieving high throughput in real-time applications like telemetry ingestion, game loops, and audio processing. By taking advantage of having only one reader and one writer, we can greatly simplify our implementation and optimize for performance.

This guide incrementally improves a basic SPSC queue through several optimization techniques:

VecDeque + Mutex — Correct but inefficient.
Two-buffer “flip” queue — Reduces locking.
Fixed-size ring buffer — Avoids dynamic allocation.
Power-of-two capacity & bit-masking — Efficient indexing.
Atomics with Acquire/Release ordering — Lock-free and correct.
Cache-line padding — Minimizes false sharing.

Each step is small, understandable, and measurable with benchmarks using criterion.

Test Setup

Accurate benchmarking requires:

criterion for reliable measurement.
affinity (Linux/macOS) or core_affinity (Windows) to bind producer and consumer to separate cores.

[dev-dependencies]
criterion = "0.5"
affinity   = "0.7"   # optional but recommended

Step-by-Step Optimization

1. Baseline: `VecDeque` + `Mutex`

The simplest implementation uses Rust’s built-in VecDeque with a mutex:

use std::{collections::VecDeque, sync::{Arc, Mutex}};

pub struct Queue<T> {
    inner: Arc<Mutex<VecDeque<T>>>,
}

impl<T> Queue<T> {
    pub fn new() -> Self {
        Self { inner: Arc::new(Mutex::new(VecDeque::new())) }
    }

    pub fn push(&self, t: T) {
        self.inner.lock().unwrap().push_back(t);
    }

    pub fn pop(&self) -> Option<T> {
        self.inner.lock().unwrap().pop_front()
    }
}

Downsides:

Both threads compete for the same lock.
VecDeque reallocates dynamically.
Latency (~150 ns/op) due to lock contention.

2. Two-Buffer “Flip” Queue

To reduce contention, use two buffers—one for the producer, one for the consumer—swapped under a single lock:

struct FlipQueue<T> {
    in_buf: Vec<T>,      // producer writes
    out_buf: Vec<T>,     // consumer reads
    lock: Mutex<()>,
}

The producer writes to in_buf and swaps buffers once full or ready, minimizing lock usage. Consumer reads from out_buf without locking.

3. Fixed-Size Ring Buffer

Replacing dynamic allocation with a fixed-size ring buffer improves predictability:

pub struct Ring<T> {
    buf: Box<[MaybeUninit<T>]>,
    mask: usize,
    head: Cell<usize>,  // producer
    tail: Cell<usize>,  // consumer
}

The head and tail indexes cycle around a fixed-size buffer, avoiding allocations after initialization.

4. Power-of-Two & Bit Masking

Choosing a buffer size that is a power of two allows efficient modulo operations:

let pos = head & self.mask;

Bit masking is significantly faster (~1.7×) than division-based modulo operations.

5. Lock-Free Ring Buffer with Atomics

Use atomic variables to achieve lock-free concurrency:

use std::sync::atomic::{AtomicUsize, Ordering::*};

pub struct Spsc<T> {
    buf: Box<[UnsafeCell<MaybeUninit<T>>]>,
    mask: usize,
    head: AtomicUsize,
    tail: AtomicUsize,
}

impl<T> Spsc<T> {
    pub fn push(&self, val: T) -> Result<(), T> {
        let head = self.head.load(Relaxed);
        let tail = self.tail.load(Acquire);

        if head - tail == self.buf.len() {
            return Err(val); // queue full
        }

        unsafe { (*self.cell(head)).write(val); }
        self.head.store(head + 1, Release);
        Ok(())
    }

    pub fn pop(&self) -> Option<T> {
        let tail = self.tail.load(Relaxed);
        let head = self.head.load(Acquire);

        if head == tail {
            return None; // queue empty
        }

        let val = unsafe { (*self.cell(tail)).assume_init_read(); }
        self.tail.store(tail + 1, Release);
        Some(val)
    }

    #[inline]
    fn cell(&self, idx: usize) -> *mut MaybeUninit<T> {
        self.buf[(idx & self.mask)].get()
    }
}

Key considerations:

Single producer/single consumer removes need for complex atomic operations.
Acquire/Release ordering ensures safe concurrent access.
Significantly faster (~4 ns/op).

6. Cache-Line Padding

Separate producer and consumer indexes into different cache lines to avoid false sharing:

#[repr(align(64))]
struct Pad<T>(T);

pub struct Spsc<T> {
    buf: Box<[UnsafeCell<MaybeUninit<T>>]>,
    mask: usize,
    head: Pad<AtomicUsize>,
    tail: Pad<AtomicUsize>,
}

Cache-line padding reduces latency further (~15% improvement).

Performance Summary

Implementation	Throughput (ops/s)	Latency (ns)
`Mutex<VecDeque>`	6M	150
Two-buffer flip	20M	45
Ring + spinlock	48M	19
Lock-free ring (no padding)	260M	7.5
Lock-free + padding	310M	6.4

(Measured on MacBook Pro M3 Max, Rust 1.79, Criterion 0.5, 64-entry queue.)