I managed to hit a p50 round-trip time of 56.5 ns (for 32-byte payloads) and a throughput of ~13.2M RTT/sec on a standard CPU (i7-12650H).
Here are the primary architectural choices that make this possible:
- Strict SPSC & No CAS: I went with a strict Single-Producer Single-Consumer topology. There are no compare-and-swap loops on the hot path. acquire_tx and acquire_rx are essentially just a load, a mask, and a branch using memory_order_acquire / release.
- Hardware Sympathy: Every control structure (message headers, atomic indices) is padded to 128-byte boundaries. False sharing between the producer and consumer cache lines is structurally impossible.
- Zero-Copy: The hot path is entirely in a memfd shared memory segment after an initial Unix Domain Socket handshake (SCM_RIGHTS).
- Hybrid Wait Strategy: The consumer spins for a bounded threshold using cpu_relax(), then falls back to a sleep via SYS_futex (Linux) or __ulock_wait (macOS) to prevent CPU starvation.
The core is C++23, and it exposes a C ABI to bind the other languages.
I am sharing this here for anyone building high-throughput polyglot architectures and dealing with cross-language ingestion bottlenecks.
> MPSC (multiple-producer single-consumer) requires a compare-and-swap loop on the head pointer so that two producers can each reserve a contiguous slot without overlap.
Martin Thompsons designs, as used in Aerons logbuffer implementation, don't require a CAS retry loop. Multiple producers can reserve and commit concurrently without blocking one another.
The trade off is you must carefully decide on an upper bound for message size and the number of producer threads (in the hundreds typically). A caretaker thread also needs to run periodically to reclaim/zero memory off the hot path. Typically though, this isn't a problem.
Aeron itself, which you compare at ~250ns, I think (not entirely sure) is paying the price for being multi consumer as well as multi producer, and perhaps implementing flow control to pace producers. You can turn off multi producer by using an exclusive publication, which eliminates one atomic RMW operation on reserve. I'm not sure where the other nanos are lost.
For SPSC, doing away with 2 atomic shared counters and moving to a single counter + inline headers is a win for thread to thread latency. The writer only needs to read the readers new position from a shared cache line when it believes the queue is full. The reader can batch writes to this counter, so it doesn't have to write to memory at all most of the time. The writer has one fewer contended cache line to write to in general since the header lives in the first cacheline of the message, which it's writing anyway. This is where the win comes from under contention (when the queue is ~empty)
This means your consumer isn't getting a lot of benefit from caching the producers position. The queue appears empty the majority of the time and it has to re-load the counter (causing it to claim the cacheline).
Meanwhile the producer goes to write message N+1 and update the counter again, and has to claim it back (S to M in MESI), when it could have just set a completion flag in the message header that the consumer hasn't touched in ages (since the ring buffer last lapped). And it's just written data to this line anyway so already has it exclusively.
So when your queue is almost always empty, this counter is just another cache line being ping ponged between cores.
This gets back to Aeron. In Aerons design the reader can get ahead of the writer and it's safe.
It instead embeds a bunch of runtimes onto the same OS thread.
Sure, the “hot path” is probably very fast for all, but what about the slow path?
It’s fairly standard to make the waiting side spin a bit after processing some data, and only issue another wait syscall if no more data arrives during the spin period.
(For instance, io_uring, which does this kind of IPC with a kernel thread on the receiving side, literally lets you configure how long said kernel thread should spin[1].)
I wonder to what extent the performance would be affected with a middle-ground option to spin a few times and then call sched_yield() syscall before spinning again.
I wouldn't be surprised if somebody develops a cross-language framework with this.