Although atomic operations should generally be avoided where possible due to their complexity, they can be crucial in the realm of high-performance and low-latency applications. They provide a mechanism for the safe manipulation of shared data across threads without the need for locks, which can be expensive in terms of performance. Here I use two of my implementations of lockfree SPSC (single producer single consumer) ringbuffers to compare atomics performance between Go and C++.
The first project, fastchan, is a Go based ringbuffer optimized for SPSC use case. Its counterpart, cpp-fastchan, started off as a closely modeled port of the fastchan project to C++.
Go claims to have its atomic operations have similar semantics as C++’s sequentially consistent memory ordering.
All the atomic operations executed in a program behave as though executed in some sequentially consistent order. This definition provides the same semantics as C++’s sequentially consistent atomics and Java’s volatile variables.
Before proceeding, it’s important to note that these results stem from a specific implementation of ring buffers and should not be viewed as microbenchmarks of atomic operations themselves. However, I prefer comparisons based on actual usage rather than synthetic microbenchmarks. YMMV.
Benchmark setup
The benchmarks are designed to measure average latency and the total time to handle a fixed throughput across the ring buffer, as measured from both sides (get and put). The two sides should essentially yield the same results.
The Go concurrency model is very different to C++ threads. In order to closely replicate C++ multi threading, both sides of the ringbuffer in the Go benchmark are locked to OS threads using runtime.LockOSThread()
Both implementations compared use the thread yield rather than busy spin during contention. This is because busy spinning does not really provide any notable performance improvement in Go. So some differences in results might be due to the difference in scheduling performance of the two implementations.
Go vs C++ seq-cst
First, let’s look at the performance of fastchan’s main branch against cpp-fastchan’s bench-seq-cst branch. This compares fastchan’s Go-based atomic operations with C++’s default sequentially consistent ordering.
Excluding the size 16 ring buffer, the performance of the C++ and Go implementations is largely comparable, although C++ exhibits a slight speed advantage.
C++: Better control
Next, let’s compare the same Go-based fastchan implementation against cpp-fastchan with optimized memory ordering. The branch bench-optimized-mem-order optimizes the memory ordering, taking advantage of the more fine-grained control provided by the C++ memory model.
These results are significantly faster. This demonstrates how the flexibility of C++’s memory model can offer substantial performance benefits in high-performance, low-latency applications.
Bonus C++: Even better control
What’s more, this is just the beginning. Optimizing the C++ code in general further provides for even better results. Here’s another set of benchmarks representing the same blocking ringbuffer with thread yield strategy with other optimizations I added over time up to this commit.
However, this might seem like an unfair comparison. Why did I not port the same optimizations back to Go? For instance, I replace CAS with store, simplify the logic, reuse variables.
The simple answer is that I have tried, where possible, and it did not improve the performance of the resulting Go code.
Conclusion
To sum up, both Go and C++ deliver robust capabilities for high-performance atomic operations. Nevertheless, if one fully utilizes the memory model of C++, it can present remarkable performance benefits. This comparison emphasizes the criticality of grasping the subtleties of diverse programming languages and their memory models when developing applications that are performance-sensitive.