New user self-registration is disabled due to spam. For an account please email bugs-admin@lists.llvm.org with your e-mail address and full name.

Bug 37716 - Poor codegen with atomics
Summary: Poor codegen with atomics
Status: NEW
Alias: None
Product: libraries
Classification: Unclassified
Component: Scalar Optimizations (show other bugs)
Version: trunk
Hardware: PC Linux
: P enhancement
Assignee: Unassigned LLVM Bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-06-06 13:12 PDT by Mathias Stearn
Modified: 2018-12-03 15:16 PST (History)
8 users (show)

See Also:
Fixed By Commit(s):


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Mathias Stearn 2018-06-06 13:12:33 PDT
https://godbolt.org/g/QcLkce

#include <atomic>

std::atomic<int> progress{-1};

void combine_writes1() {
    // These should be reduced to a single store(0,release),
    // At least as long as release-sequence includes same-thread
    // relaxed stores. If that is removed, this should just be
    // a single relaxed store.
    progress.store(0, std::memory_order_relaxed);
    progress.store(0, std::memory_order_relaxed);
    progress.store(0, std::memory_order_release);
    progress.store(0, std::memory_order_release);
    progress.store(0, std::memory_order_relaxed);
    progress.store(0, std::memory_order_relaxed);
}

void combine_writes2() {
    // Ditto above, but should store 5.
    progress.store(0, std::memory_order_relaxed);
    progress.store(1, std::memory_order_relaxed);
    progress.store(2, std::memory_order_release);
    progress.store(3, std::memory_order_release);
    progress.store(4, std::memory_order_relaxed);
    progress.store(5, std::memory_order_relaxed);
}

void combine_loads() {
    // These should be reduced to either a single acquire-load
    // or an acquire fence. 
    progress.load(std::memory_order_relaxed);
    progress.load(std::memory_order_relaxed);
    progress.load(std::memory_order_acquire);
    progress.load(std::memory_order_acquire);
    progress.load(std::memory_order_relaxed);
    progress.load(std::memory_order_relaxed);
}

combine_writes1(): # @combine_writes1()
  mov dword ptr [rip + progress], 0
  mov dword ptr [rip + progress], 0
  mov dword ptr [rip + progress], 0
  mov dword ptr [rip + progress], 0
  mov dword ptr [rip + progress], 0
  mov dword ptr [rip + progress], 0
  ret
combine_writes2(): # @combine_writes2()
  mov dword ptr [rip + progress], 0
  mov dword ptr [rip + progress], 1
  mov dword ptr [rip + progress], 2
  mov dword ptr [rip + progress], 3
  mov dword ptr [rip + progress], 4
  mov dword ptr [rip + progress], 5
  ret
combine_loads(): # @combine_loads()
  mov eax, dword ptr [rip + progress]
  mov eax, dword ptr [rip + progress]
  mov eax, dword ptr [rip + progress]
  mov eax, dword ptr [rip + progress]
  mov eax, dword ptr [rip + progress]
  mov eax, dword ptr [rip + progress]
  ret
progress:
  .long 4294967295 # 0xffffffff

Possibly related to https://bugs.llvm.org/show_bug.cgi?id=37690
Comment 1 Davide Italiano 2018-06-07 14:38:09 PDT
We already lower consecutive fences (https://reviews.llvm.org/D29314) so the request is not unreasonable. If somebody wants to take a stab at this, it should follow the same pattern I used there (i.e. a peephole in instcombine, or instSimplify as this doesn't introduce new instructions).

FWIW, the ARM backend has some pass for strength reducing atomic intrinsics as well, although I think these transformation should be target-independent.
Comment 2 JF Bastien 2018-06-08 00:18:27 PDT
We can definitely do this (see http://wg21.link/n4455), but is this something that you've seen happen in real code? I totally believe it can happen, but I'm not super interested in optimizing code which never happens :-)

Also, consider http://wg21.link/P0062 when doing any of this.
Comment 3 comex 2018-11-20 16:25:27 PST
This seems to be affecting performance in some real Rust code.  The code has an atomic pointer field where the least significant bit is used to store a flag.  The pointer value can change, but all writers are guaranteed to preserve the value of the lowest bit.  Thus, the function that reads that flag can use a relaxed atomic load.  However, the author instead opted to use a non-atomic load, even though that's UB, because "simple benchmarks show up to a 10% slowdown using a `Relaxed` atomic load on x86".  This is probably because multiple redundant loads are not being merged.

Details at:
https://internals.rust-lang.org/t/bit-wise-reasoning-for-atomic-accesses/8853

Also, although this isn't real code, I can easily imagine this coming up with "init once"-type loads.  For example, in this C++ code:

extern int calculate_magic_number();
static int multiply_by_magic_number(int a) {
    static int magic = calculate_magic_number();
    return a * magic;
}
int get_val() {
    return multiply_by_magic_number(2) + multiply_by_magic_number(3);
}

`multiply_by_magic_number` uses an atomic load to test whether `magic` was already initialized.  This is then inlined twice into `get_val`.  Ideally, we would only need one load of the guard variable, at least in the fast path where it has already been initialized.