Go vs C++ Atomics Performance

Although atomic operations should generally be avoided where possible due to their complexity, they can be crucial in the realm of high-performance and low-latency applications. They provide a mechanism for the safe manipulation of shared data across threads without the need for locks, which can be expensive in terms of performance. Here I use two of my implementations of lockfree SPSC (single producer single consumer) ringbuffers to compare atomics performance between Go and C++.

The first project, fastchan, is a Go based ringbuffer optimized for SPSC use case. Its counterpart, cpp-fastchan, started off as a closely modeled port of the fastchan project to C++.

Go claims to have its atomic operations have similar semantics as C++’s sequentially consistent memory ordering.

All the atomic operations executed in a program behave as though executed in some sequentially consistent order. This definition provides the same semantics as C++’s sequentially consistent atomics and Java’s volatile variables.

Before proceeding, it’s important to note that these results stem from a specific implementation of ring buffers and should not be viewed as microbenchmarks of atomic operations themselves. However, I prefer comparisons based on actual usage rather than synthetic microbenchmarks. YMMV.

Benchmark setup

The benchmarks are designed to measure average latency and the total time to handle a fixed throughput across the ring buffer, as measured from both sides (get and put). The two sides should essentially yield the same results.

The C++ benchmark uses the following code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
template <size_t min_size>
void benchmarkFastChanPut(int n) {
    fastchan::FastChan<uint8_t, min_size> c;

    std::thread reader([&]() {
        for (int i = 0; i < n; i++) {
            c.get();
        }
    });

    auto start = std::chrono::steady_clock::now();
    for (int i = 0; i < n; i++) {
        c.put(0);
    }
    auto end = std::chrono::steady_clock::now();

    reader.join();

    auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
    std::cout << "BenchmarkFastChanPut" << min_size << "\t" << n << "\t" << duration / n << " ns/op" << std::endl;
}

template <size_t min_size>
void benchmarkFastChanGet(int n) {
    fastchan::FastChan<uint8_t, min_size> c;

    std::thread reader([&]() {
        for (int i = 0; i < n; i++) {
            c.put(0);
        }
    });

    auto start = std::chrono::steady_clock::now();
    for (int i = 0; i < n; i++) {
        c.get();
    }
    auto end = std::chrono::steady_clock::now();

    reader.join();

    auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
    std::cout << "BenchmarkFastChanGet" << min_size << "\t" << n << "\t" << duration / n << " ns/op" << std::endl;
}

int main() {
    auto n = 5'000'000;
    benchmarkFastChanPut<16>(n);
    benchmarkFastChanPut<64>(n);
    benchmarkFastChanPut<256>(n);
    benchmarkFastChanPut<1024>(n);
    benchmarkFastChanPut<4096>(n);
    benchmarkFastChanPut<16'384>(n);
    benchmarkFastChanPut<65'536>(n);
    benchmarkFastChanPut<262'144>(n);
    benchmarkFastChanPut<1'048'576>(n);

    benchmarkFastChanGet<16>(n);
    benchmarkFastChanGet<64>(n);
    benchmarkFastChanGet<256>(n);
    benchmarkFastChanGet<1024>(n);
    benchmarkFastChanGet<4096>(n);
    benchmarkFastChanGet<16'384>(n);
    benchmarkFastChanGet<65'536>(n);
    benchmarkFastChanGet<262'144>(n);
    benchmarkFastChanGet<1'048'576>(n);

    return 0;
}

In a similar fashion, the Go benchmark uses the following code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
func benchmarkFastChanPut(size uint64, n int) {
	c := fastchan.NewFastChan(size)

	go func() {
		runtime.LockOSThread()
		for i := 0; i < n; i++ {
			c.Read()
		}
	}()

	start := time.Now()
	for i := 0; i < n; i++ {
		c.Put(0)
	}

	duration := time.Since(start)
	fmt.Printf("BenchmarkFastChanGet%d\t%d\t%d ns/op\n", size, n, duration.Nanoseconds()/int64(n))
}

func benchmarkFastChanGet(size uint64, n int) {
	c := fastchan.NewFastChan(size)

	go func() {
		runtime.LockOSThread()
		for i := 0; i < n; i++ {
			c.Put(0)
		}
	}()

	start := time.Now()
	for i := 0; i < n; i++ {
		c.Read()
	}
	duration := time.Since(start)
	fmt.Printf("BenchmarkFastChanGet%d\t%d\t%d ns/op\n", size, n, duration.Nanoseconds()/int64(n))
}

func main() {
	n := 5_000_000

	benchmarkFastChanPut(16, n)
	benchmarkFastChanPut(64, n)
	benchmarkFastChanPut(256, n)
	benchmarkFastChanPut(1024, n)
	benchmarkFastChanPut(4096, n)
	benchmarkFastChanPut(16_384, n)
	benchmarkFastChanPut(65_536, n)
	benchmarkFastChanPut(262_144, n)
	benchmarkFastChanPut(1_048_576, n)

	benchmarkFastChanGet(16, n)
	benchmarkFastChanGet(64, n)
	benchmarkFastChanGet(256, n)
	benchmarkFastChanGet(1024, n)
	benchmarkFastChanGet(4096, n)
	benchmarkFastChanGet(16_384, n)
	benchmarkFastChanGet(65_536, n)
	benchmarkFastChanGet(262_144, n)
	benchmarkFastChanGet(1_048_576, n)
}

There’s a couple of caveats:

  • The Go concurrency model is very different to C++ threads. In order to closely replicate C++ multi threading, both sides of the ringbuffer in the Go benchmark are locked to OS threads using runtime.LockOSThread()
  • Both implementations compared use the thread yield rather than busy spin during contention. This is because busy spinning does not really provide any notable performance improvement in Go. So some differences in results might be due to the difference in scheduling performance of the two implementations.

Go vs C++ seq-cst

First, let’s look at the performance of fastchan’s main branch against cpp-fastchan’s bench-seq-cst branch. This compares fastchan’s Go-based atomic operations with C++’s default sequentially consistent ordering.

Go:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
BenchmarkFastChanPut16	5000000	883 ns/op
BenchmarkFastChanPut64	5000000	164 ns/op
BenchmarkFastChanPut256	5000000	114 ns/op
BenchmarkFastChanPut1024	5000000	137 ns/op
BenchmarkFastChanPut4096	5000000	125 ns/op
BenchmarkFastChanPut16384	5000000	122 ns/op
BenchmarkFastChanPut65536	5000000	116 ns/op
BenchmarkFastChanPut262144	5000000	118 ns/op
BenchmarkFastChanPut1048576	5000000	106 ns/op
BenchmarkFastChanGet16	5000000	808 ns/op
BenchmarkFastChanGet64	5000000	174 ns/op
BenchmarkFastChanGet256	5000000	140 ns/op
BenchmarkFastChanGet1024	5000000	130 ns/op
BenchmarkFastChanGet4096	5000000	123 ns/op
BenchmarkFastChanGet16384	5000000	127 ns/op
BenchmarkFastChanGet65536	5000000	122 ns/op
BenchmarkFastChanGet262144	5000000	127 ns/op
BenchmarkFastChanGet1048576	5000000	107 ns/op

C++:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
BenchmarkFastChanPut16	5000000	119 ns/op
BenchmarkFastChanPut64	5000000	116 ns/op
BenchmarkFastChanPut256	5000000	115 ns/op
BenchmarkFastChanPut1024	5000000	114 ns/op
BenchmarkFastChanPut4096	5000000	108 ns/op
BenchmarkFastChanPut16384	5000000	144 ns/op
BenchmarkFastChanPut65536	5000000	109 ns/op
BenchmarkFastChanPut262144	5000000	110 ns/op
BenchmarkFastChanPut1048576	5000000	131 ns/op
BenchmarkFastChanGet16	5000000	130 ns/op
BenchmarkFastChanGet64	5000000	113 ns/op
BenchmarkFastChanGet256	5000000	113 ns/op
BenchmarkFastChanGet1024	5000000	109 ns/op
BenchmarkFastChanGet4096	5000000	106 ns/op
BenchmarkFastChanGet16384	5000000	104 ns/op
BenchmarkFastChanGet65536	5000000	104 ns/op
BenchmarkFastChanGet262144	5000000	104 ns/op
BenchmarkFastChanGet1048576	5000000	94 ns/op

Excluding the size 16 ring buffer, the performance of the C++ and Go implementations is largely comparable, although C++ exhibits a slight speed advantage.

C++: Better control

Next, let’s compare the same Go-based fastchan implementation against cpp-fastchan with optimized memory ordering. The branch bench-optimized-mem-order optimizes the memory ordering, taking advantage of the more fine-grained control provided by the C++ memory model.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
BenchmarkFastChanPut16	5000000	72 ns/op
BenchmarkFastChanPut64	5000000	71 ns/op
BenchmarkFastChanPut256	5000000	92 ns/op
BenchmarkFastChanPut1024	5000000	81 ns/op
BenchmarkFastChanPut4096	5000000	85 ns/op
BenchmarkFastChanPut16384	5000000	87 ns/op
BenchmarkFastChanPut65536	5000000	69 ns/op
BenchmarkFastChanPut262144	5000000	83 ns/op
BenchmarkFastChanPut1048576	5000000	83 ns/op
BenchmarkFastChanGet16	5000000	88 ns/op
BenchmarkFastChanGet64	5000000	67 ns/op
BenchmarkFastChanGet256	5000000	76 ns/op
BenchmarkFastChanGet1024	5000000	80 ns/op
BenchmarkFastChanGet4096	5000000	79 ns/op
BenchmarkFastChanGet16384	5000000	68 ns/op
BenchmarkFastChanGet65536	5000000	75 ns/op
BenchmarkFastChanGet262144	5000000	82 ns/op
BenchmarkFastChanGet1048576	5000000	85 ns/op

These results are significantly faster. This demonstrates how the flexibility of C++’s memory model can offer substantial performance benefits in high-performance, low-latency applications.

Bonus C++: Even better control

What’s more, this is just the beginning. Optimizing the C++ code in general further provides for even better results. Here’s another set of benchmarks representing the same blocking ringbuffer with thread yield strategy with other optimizations I added over time up to this commit.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
BenchmarkFastChan-Put16	5000000	25 ns/op
BenchmarkFastChan-Put64	5000000	8 ns/op
BenchmarkFastChan-Put256	5000000	5 ns/op
BenchmarkFastChan-Put1024	5000000	6 ns/op
BenchmarkFastChan-Put4096	5000000	6 ns/op
BenchmarkFastChan-Put16384	5000000	5 ns/op
BenchmarkFastChan-Put65536	5000000	5 ns/op
BenchmarkFastChan-Put262144	5000000	6 ns/op
BenchmarkFastChan-Put1048576	5000000	6 ns/op
BenchmarkFastChan-Get16	5000000	21 ns/op
BenchmarkFastChan-Get64	5000000	8 ns/op
BenchmarkFastChan-Get256	5000000	6 ns/op
BenchmarkFastChan-Get1024	5000000	6 ns/op
BenchmarkFastChan-Get4096	5000000	6 ns/op
BenchmarkFastChan-Get16384	5000000	5 ns/op
BenchmarkFastChan-Get65536	5000000	6 ns/op
BenchmarkFastChan-Get262144	5000000	7 ns/op
BenchmarkFastChan-Get1048576	5000000	6 ns/op

Now that is fast!

However, this might seem like an unfair comparison. Why did I not port the same optimizations back to Go? For instance, I replace CAS with store, simplify the logic, reuse variables.

The simple answer is that I have tried, where possible, and it did not improve the performance of the resulting Go code.

Conclusion

To sum up, both Go and C++ deliver robust capabilities for high-performance atomic operations. Nevertheless, if one fully utilizes the memory model of C++, it can present remarkable performance benefits. This comparison emphasizes the criticality of grasping the subtleties of diverse programming languages and their memory models when developing applications that are performance-sensitive.