Adventures In Message Passing (with Rust)
Single-producer/single-consumer (SPSC) queues are essential for achieving high throughput in real-time applications like telemetry ingestion, game loops, and audio processing. By taking advantage of having only one reader and one writer, we can greatly simplify our implementation and optimize for performance.
This guide incrementally improves a basic SPSC queue through several optimization techniques:
VecDeque
+Mutex
â Correct but inefficient.- Two-buffer âflipâ queue â Reduces locking.
- Fixed-size ring buffer â Avoids dynamic allocation.
- Power-of-two capacity & bit-masking â Efficient indexing.
- Atomics with
Acquire
/Release
ordering â Lock-free and correct. - Cache-line padding â Minimizes false sharing.
Each step is small, understandable, and measurable with benchmarks using criterion
.
Test Setup
Accurate benchmarking requires:
criterion
for reliable measurement.affinity
(Linux/macOS) orcore_affinity
(Windows) to bind producer and consumer to separate cores.
[dev-dependencies]
criterion = "0.5"
affinity = "0.7" # optional but recommended
Step-by-Step Optimization
1. Baseline: VecDeque
+ Mutex
The simplest implementation uses Rustâs built-in VecDeque
with a mutex:
use std::{collections::VecDeque, sync::{Arc, Mutex}};
pub struct Queue<T> {
inner: Arc<Mutex<VecDeque<T>>>,
}
impl<T> Queue<T> {
pub fn new() -> Self {
Self { inner: Arc::new(Mutex::new(VecDeque::new())) }
}
pub fn push(&self, t: T) {
self.inner.lock().unwrap().push_back(t);
}
pub fn pop(&self) -> Option<T> {
self.inner.lock().unwrap().pop_front()
}
}
Downsides:
- Both threads compete for the same lock.
VecDeque
reallocates dynamically.- Latency (~150 ns/op) due to lock contention.
2. Two-Buffer âFlipâ Queue
To reduce contention, use two buffersâone for the producer, one for the consumerâswapped under a single lock:
struct FlipQueue<T> {
in_buf: Vec<T>, // producer writes
out_buf: Vec<T>, // consumer reads
lock: Mutex<()>,
}
The producer writes to in_buf
and swaps buffers once full or ready, minimizing lock usage. Consumer reads from out_buf
without locking.
3. Fixed-Size Ring Buffer
Replacing dynamic allocation with a fixed-size ring buffer improves predictability:
pub struct Ring<T> {
buf: Box<[MaybeUninit<T>]>,
mask: usize,
head: Cell<usize>, // producer
tail: Cell<usize>, // consumer
}
The head
and tail
indexes cycle around a fixed-size buffer, avoiding allocations after initialization.
4. Power-of-Two & Bit Masking
Choosing a buffer size that is a power of two allows efficient modulo operations:
let pos = head & self.mask;
Bit masking is significantly faster (~1.7Ă) than division-based modulo operations.
5. Lock-Free Ring Buffer with Atomics
Use atomic variables to achieve lock-free concurrency:
use std::sync::atomic::{AtomicUsize, Ordering::*};
pub struct Spsc<T> {
buf: Box<[UnsafeCell<MaybeUninit<T>>]>,
mask: usize,
head: AtomicUsize,
tail: AtomicUsize,
}
impl<T> Spsc<T> {
pub fn push(&self, val: T) -> Result<(), T> {
let head = self.head.load(Relaxed);
let tail = self.tail.load(Acquire);
if head - tail == self.buf.len() {
return Err(val); // queue full
}
unsafe { (*self.cell(head)).write(val); }
self.head.store(head + 1, Release);
Ok(())
}
pub fn pop(&self) -> Option<T> {
let tail = self.tail.load(Relaxed);
let head = self.head.load(Acquire);
if head == tail {
return None; // queue empty
}
let val = unsafe { (*self.cell(tail)).assume_init_read(); }
self.tail.store(tail + 1, Release);
Some(val)
}
#[inline]
fn cell(&self, idx: usize) -> *mut MaybeUninit<T> {
self.buf[(idx & self.mask)].get()
}
}
Key considerations:
- Single producer/single consumer removes need for complex atomic operations.
Acquire
/Release
ordering ensures safe concurrent access.- Significantly faster (~4 ns/op).
6. Cache-Line Padding
Separate producer and consumer indexes into different cache lines to avoid false sharing:
#[repr(align(64))]
struct Pad<T>(T);
pub struct Spsc<T> {
buf: Box<[UnsafeCell<MaybeUninit<T>>]>,
mask: usize,
head: Pad<AtomicUsize>,
tail: Pad<AtomicUsize>,
}
Cache-line padding reduces latency further (~15% improvement).
Performance Summary
Implementation | Throughput (ops/s) | Latency (ns) |
---|---|---|
Mutex<VecDeque> |
6M | 150 |
Two-buffer flip | 20M | 45 |
Ring + spinlock | 48M | 19 |
Lock-free ring (no padding) | 260M | 7.5 |
Lock-free + padding | 310M | 6.4 |
(Measured on MacBook Pro M3 Max, Rust 1.79, Criterion 0.5, 64-entry queue.)