Performance impact of C++ Memory Order vs x86 Strong Ordering

Recently in one of the teams I work closely with there was an interesting question that went roughly like this: Given that x86 has strong ordering is there any point from a performance perspective to using anything other than the C++ default (sequentially consistent) memory ordering?

My answer was: yes, absolutely.

So, as usual, here’s me making a post out of the rest.

The Misconception

C++ memory model is an abstraction designed to work predictably across a wide variety of hardware. While there’s some relation between this and the x86 ordering guarantees, the two are not the same. Although x86 provides a strong memory model, it doesn’t inherently offer the strict guarantees of sequentially consistent (std::memory_order_seq_cst) memory ordering.

Memory Order in C++

C++ defines several options to specify the order of memory accesses around atomic operations: sequentially consistent, acquire-release (std::memory_order_acquire / std::memory_order_release), and relaxed (std::memory_order_relaxed). We’re gonna ignore consume (std::memory_order_consume) for now.

These options guide not just the order of execution memory accesses (and consequently instructions) on a multiprocessor system but also that of the instructions generated by the compiler. This is a subtle and important point. The compiler can reorder instructions at compile time, and the CPU can also re-order instructions at runtime, but any memory access options constrains the re-ordering.

Sequential Consistency

Sequential consistency guarantees that all threads observe all modifications in the same order, effectively ensuring a single total order of all operations. This is the strongest form of ordering. x86’s strong memory model doesn’t ensure a single total order across all cores, which sequential consistency guarantees.

Sequentially consistent operations in C++ often compile to lock-prefixed or lock-equivalent instructions or memory fence instructions on x86. These instructions can be significantly more expensive than their acquire-release or relaxed counterparts. Acquire-release semantics are usually enforced by the hardware on x86, meaning the compiler doesn’t need to generate additional instructions.

Consider the following example:

1
2
3
4
5
6
7
8
9
std::atomic<int> x(0);

void seq_cst_example() {
    x.store(1, std::memory_order_seq_cst);
}

void acq_rel_example() {
    x.store(1, std::memory_order_release);
}

When compiled with optimizations (-O2), the generated assembly could look something like this:

1
2
3
4
5
6
7
8
9
seq_cst_example():
        mov     eax, 1
        xchg    eax, DWORD PTR x[rip]
        ret
acq_rel_example():
        mov     DWORD PTR x[rip], 1
        ret
x:
        .zero   4

In the above code, the sequentially consistent store is translated to a an XCHG instruction, whereas the release store is just a regular MOV. The XCHG instruction automatically follows LOCK semantics when referencing memory (9.1.2.1 of Vol 3 here). LOCK refers to bus lock where bus control is locked. This is inherently a significantly more expensive operation.

Acquire-Release

Acquire-release semantics, on the other hand, ensure that memory operations before a store-release are visible to a load-acquire that reads from the same location. In essence, they introduce synchronization points but don’t dictate a strict order of all memory operations.

This maps more closely to the x86 ordering model. As in the above example, the release store is a simple MOV, and x86 architecture naturally enforces release semantics due to its strong memory model.

Relaxed

So if x86 inherently provides C++ acquire-release equivalent semantics, does it mean there’s no real point to using relaxed semantics on x86 from a performance perspective?

Before I answer that, a quick warning: unless you are sure of what you’re doing (and even when you do), it’s probably best to avoid using relaxed as much as possible as it makes the code significantly harder to reason about, and while usually this might be hidden by the acquire-release type ordering x86 inherently, it is very easy to create subtle bugs with this.

That said, even on x86, relaxed semantics can have different performance characteristics compared to acquire-release. Remember my earlier point about the distinction between compiler ordering and x86 ordering? While x86 ordering applies at runtime, specifying relaxed for your memory access provides the compiler with more freedom to reorder instructions at compile time to optimize performance.

Consider the following code which illustrates both the potential performance impact and the subtle bugs that can be introduced when employing relaxed semantics:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
std::atomic<int> x;
int a = 0;

void test_release() {
  for (auto i = 0; i < 5; i++) {
    a = 0;
    x.store(1, std::memory_order_release);
    a = 1;
  }
}

void test_relaxed() {
  for (auto i = 0; i < 5; i++) {
    a = 0;
    x.store(1, std::memory_order_relaxed);
    a = 1;
  }
}

When compiled with -O2, the code looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
test_release():
        mov     eax, 5
.L2:
        mov     DWORD PTR a[rip], 0
        mov     DWORD PTR x[rip], 1
        mov     DWORD PTR a[rip], 1
        sub     eax, 1
        jne     .L2
        ret
test_relaxed():
        mov     eax, 5
.L6:
        mov     DWORD PTR x[rip], 1
        mov     DWORD PTR a[rip], 1
        sub     eax, 1
        jne     .L6
        ret
a:
        .zero   4
x:
        .zero   4

Or a slightly more obvious version:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
test_release():                      # @test_release()
        mov     dword ptr [rip + a], 0
        mov     dword ptr [rip + x], 1
        mov     dword ptr [rip + a], 0
        mov     dword ptr [rip + x], 1
        mov     dword ptr [rip + a], 0
        mov     dword ptr [rip + x], 1
        mov     dword ptr [rip + a], 0
        mov     dword ptr [rip + x], 1
        mov     dword ptr [rip + a], 0
        mov     dword ptr [rip + x], 1
        mov     dword ptr [rip + a], 1
        ret
test_relaxed():                      # @test_relaxed()
        mov     dword ptr [rip + x], 1
        mov     dword ptr [rip + x], 1
        mov     dword ptr [rip + x], 1
        mov     dword ptr [rip + x], 1
        mov     dword ptr [rip + x], 1
        mov     dword ptr [rip + a], 1
        ret
x:
        .zero   4

a:
        .long   0                               # 0x0

Which one looks faster? Obviously the relaxed version. The compiler optimized away all but one of the a = 1 when we tell it that the ordering of memory accesses on x don’t matter.

Hopefully it’s also obvious why this could possibly create a bug? The optimized version never sets a = 0 and if there was a dependency somewhere on the change in the state of this variable between 0 and 1 that could be a bug. While this example is simplistic, it’s important to carefully consider the impact of relaxed usage.

Correctness Trumps Performance

In the real-world scenario, the proper use of memory ordering should be dictated by the correctness of the multithreaded interaction before anything else. Performance gains, if any, should be a secondary consideration.

While relaxed and acquire-release semantics can provide performance benefits, it’s important to understand the trade-offs. Using these semantics requires an understanding of the memory model and can make the code more complex and harder to reason about. While acquire-release semantics can be more efficient on x86, don’t use them when you actually need sequential consistency.

However, when performance is critical and the code is well-tested, carefully reviewed and can be reasoned about, the performance benefits of using a looser memory order constraint over sequential consistency can be significant.

Conclusion

When it comes to performance, a case-by-case evaluation is essential. While it’s useful to understand the general behavior and performance characteristics, the exact impact can vary depending on the specific code and usage scenario. Profiling is your best friend in these situations. Only thorough measurement will show you the actual performance implications of using different memory orders in your specific context.