The #1 Ben Levi Blog On The Internet

Adventures In Message Passing (with Rust)

Single-producer/single-consumer (SPSC) queues are essential for achieving high throughput in real-time applications like telemetry ingestion, game loops, and audio processing. By taking advantage of having only one reader and one writer, we can greatly simplify our implementation and optimize for performance.

This guide incrementally improves a basic SPSC queue through several optimization techniques:

  1. VecDeque + Mutex — Correct but inefficient.
  2. Two-buffer “flip” queue — Reduces locking.
  3. Fixed-size ring buffer — Avoids dynamic allocation.
  4. Power-of-two capacity & bit-masking — Efficient indexing.
  5. Atomics with Acquire/Release ordering — Lock-free and correct.
  6. Cache-line padding — Minimizes false sharing.

Each step is small, understandable, and measurable with benchmarks using criterion.

Test Setup

Accurate benchmarking requires:

[dev-dependencies]
criterion = "0.5"
affinity   = "0.7"   # optional but recommended

Step-by-Step Optimization

1. Baseline: VecDeque + Mutex

The simplest implementation uses Rust’s built-in VecDeque with a mutex:

use std::{collections::VecDeque, sync::{Arc, Mutex}};

pub struct Queue<T> {
    inner: Arc<Mutex<VecDeque<T>>>,
}

impl<T> Queue<T> {
    pub fn new() -> Self {
        Self { inner: Arc::new(Mutex::new(VecDeque::new())) }
    }

    pub fn push(&self, t: T) {
        self.inner.lock().unwrap().push_back(t);
    }

    pub fn pop(&self) -> Option<T> {
        self.inner.lock().unwrap().pop_front()
    }
}

Downsides:

2. Two-Buffer “Flip” Queue

To reduce contention, use two buffers—one for the producer, one for the consumer—swapped under a single lock:

struct FlipQueue<T> {
    in_buf: Vec<T>,      // producer writes
    out_buf: Vec<T>,     // consumer reads
    lock: Mutex<()>,
}

The producer writes to in_buf and swaps buffers once full or ready, minimizing lock usage. Consumer reads from out_buf without locking.

3. Fixed-Size Ring Buffer

Replacing dynamic allocation with a fixed-size ring buffer improves predictability:

pub struct Ring<T> {
    buf: Box<[MaybeUninit<T>]>,
    mask: usize,
    head: Cell<usize>,  // producer
    tail: Cell<usize>,  // consumer
}

The head and tail indexes cycle around a fixed-size buffer, avoiding allocations after initialization.

4. Power-of-Two & Bit Masking

Choosing a buffer size that is a power of two allows efficient modulo operations:

let pos = head & self.mask;

Bit masking is significantly faster (~1.7×) than division-based modulo operations.

5. Lock-Free Ring Buffer with Atomics

Use atomic variables to achieve lock-free concurrency:

use std::sync::atomic::{AtomicUsize, Ordering::*};

pub struct Spsc<T> {
    buf: Box<[UnsafeCell<MaybeUninit<T>>]>,
    mask: usize,
    head: AtomicUsize,
    tail: AtomicUsize,
}

impl<T> Spsc<T> {
    pub fn push(&self, val: T) -> Result<(), T> {
        let head = self.head.load(Relaxed);
        let tail = self.tail.load(Acquire);

        if head - tail == self.buf.len() {
            return Err(val); // queue full
        }

        unsafe { (*self.cell(head)).write(val); }
        self.head.store(head + 1, Release);
        Ok(())
    }

    pub fn pop(&self) -> Option<T> {
        let tail = self.tail.load(Relaxed);
        let head = self.head.load(Acquire);

        if head == tail {
            return None; // queue empty
        }

        let val = unsafe { (*self.cell(tail)).assume_init_read(); }
        self.tail.store(tail + 1, Release);
        Some(val)
    }

    #[inline]
    fn cell(&self, idx: usize) -> *mut MaybeUninit<T> {
        self.buf[(idx & self.mask)].get()
    }
}

Key considerations:

6. Cache-Line Padding

Separate producer and consumer indexes into different cache lines to avoid false sharing:

#[repr(align(64))]
struct Pad<T>(T);

pub struct Spsc<T> {
    buf: Box<[UnsafeCell<MaybeUninit<T>>]>,
    mask: usize,
    head: Pad<AtomicUsize>,
    tail: Pad<AtomicUsize>,
}

Cache-line padding reduces latency further (~15% improvement).

Performance Summary

Implementation Throughput (ops/s) Latency (ns)
Mutex<VecDeque> 6M 150
Two-buffer flip 20M 45
Ring + spinlock 48M 19
Lock-free ring (no padding) 260M 7.5
Lock-free + padding 310M 6.4

(Measured on MacBook Pro M3 Max, Rust 1.79, Criterion 0.5, 64-entry queue.)