https://godbolt.org/g/QcLkce #include <atomic> std::atomic<int> progress{-1}; void combine_writes1() { // These should be reduced to a single store(0,release), // At least as long as release-sequence includes same-thread // relaxed stores. If that is removed, this should just be // a single relaxed store. progress.store(0, std::memory_order_relaxed); progress.store(0, std::memory_order_relaxed); progress.store(0, std::memory_order_release); progress.store(0, std::memory_order_release); progress.store(0, std::memory_order_relaxed); progress.store(0, std::memory_order_relaxed); } void combine_writes2() { // Ditto above, but should store 5. progress.store(0, std::memory_order_relaxed); progress.store(1, std::memory_order_relaxed); progress.store(2, std::memory_order_release); progress.store(3, std::memory_order_release); progress.store(4, std::memory_order_relaxed); progress.store(5, std::memory_order_relaxed); } void combine_loads() { // These should be reduced to either a single acquire-load // or an acquire fence. progress.load(std::memory_order_relaxed); progress.load(std::memory_order_relaxed); progress.load(std::memory_order_acquire); progress.load(std::memory_order_acquire); progress.load(std::memory_order_relaxed); progress.load(std::memory_order_relaxed); } combine_writes1(): # @combine_writes1() mov dword ptr [rip + progress], 0 mov dword ptr [rip + progress], 0 mov dword ptr [rip + progress], 0 mov dword ptr [rip + progress], 0 mov dword ptr [rip + progress], 0 mov dword ptr [rip + progress], 0 ret combine_writes2(): # @combine_writes2() mov dword ptr [rip + progress], 0 mov dword ptr [rip + progress], 1 mov dword ptr [rip + progress], 2 mov dword ptr [rip + progress], 3 mov dword ptr [rip + progress], 4 mov dword ptr [rip + progress], 5 ret combine_loads(): # @combine_loads() mov eax, dword ptr [rip + progress] mov eax, dword ptr [rip + progress] mov eax, dword ptr [rip + progress] mov eax, dword ptr [rip + progress] mov eax, dword ptr [rip + progress] mov eax, dword ptr [rip + progress] ret progress: .long 4294967295 # 0xffffffff Possibly related to https://bugs.llvm.org/show_bug.cgi?id=37690
We already lower consecutive fences (https://reviews.llvm.org/D29314) so the request is not unreasonable. If somebody wants to take a stab at this, it should follow the same pattern I used there (i.e. a peephole in instcombine, or instSimplify as this doesn't introduce new instructions). FWIW, the ARM backend has some pass for strength reducing atomic intrinsics as well, although I think these transformation should be target-independent.
We can definitely do this (see http://wg21.link/n4455), but is this something that you've seen happen in real code? I totally believe it can happen, but I'm not super interested in optimizing code which never happens :-) Also, consider http://wg21.link/P0062 when doing any of this.
This seems to be affecting performance in some real Rust code. The code has an atomic pointer field where the least significant bit is used to store a flag. The pointer value can change, but all writers are guaranteed to preserve the value of the lowest bit. Thus, the function that reads that flag can use a relaxed atomic load. However, the author instead opted to use a non-atomic load, even though that's UB, because "simple benchmarks show up to a 10% slowdown using a `Relaxed` atomic load on x86". This is probably because multiple redundant loads are not being merged. Details at: https://internals.rust-lang.org/t/bit-wise-reasoning-for-atomic-accesses/8853 Also, although this isn't real code, I can easily imagine this coming up with "init once"-type loads. For example, in this C++ code: extern int calculate_magic_number(); static int multiply_by_magic_number(int a) { static int magic = calculate_magic_number(); return a * magic; } int get_val() { return multiply_by_magic_number(2) + multiply_by_magic_number(3); } `multiply_by_magic_number` uses an atomic load to test whether `magic` was already initialized. This is then inlined twice into `get_val`. Ideally, we would only need one load of the guard variable, at least in the fast path where it has already been initialized.