This simple test: // ---- begin #include <intrin.h> extern "C" unsigned long long f(unsigned long long a, int b) { return _rotl64(a, b); } extern "C" unsigned long long g(unsigned long long a, int b) { return (a << (b & 63)) | (a >> (-b & 63)); } // ---- end produces (on x86-64 using clang 6.0 release; only quoting the relevant bits): # ---- begin f: # @f # %bb.0: movq %rcx, %r8 andl $63, %edx movq %r8, %rax movl %edx, %ecx rolq %cl, %rax testl %edx, %edx cmoveq %r8, %rax retq g: # @g # %bb.0: movq %rcx, %rax movl %edx, %ecx rolq %cl, %rax retq # ---- end The corresponding IR is: ; ---- begin ; Function Attrs: norecurse nounwind readnone sspstrong uwtable define i64 @f(i64, i32) local_unnamed_addr #0 { %3 = and i32 %1, 63 %4 = zext i32 %3 to i64 %5 = sub nsw i64 64, %4 %6 = shl i64 %0, %4 %7 = lshr i64 %0, %5 %8 = or i64 %7, %6 %9 = icmp eq i32 %3, 0 %10 = select i1 %9, i64 %0, i64 %8 ret i64 %10 } ; Function Attrs: norecurse nounwind readnone sspstrong uwtable define i64 @g(i64, i32) local_unnamed_addr #0 { %3 = and i32 %1, 63 %4 = zext i32 %3 to i64 %5 = shl i64 %0, %4 %6 = sub nsw i32 0, %1 %7 = and i32 %6, 63 %8 = zext i32 %7 to i64 %9 = lshr i64 %0, %8 %10 = or i64 %5, %9 ret i64 %10 } ; ---- end The problem is the expansion chosen for the rotr/rotl intrinsics in CGBuiltin.cpp CodeGenFunction::EmitBuiltinExpr, presumably to avoid implementation-specific behavior from the right shift by 64-b. Note that the alternative expansion for rotate-left given in the code for g avoids the problematic select, is well-defined, and already gets matched to ROL (in the x86-64 backend anyway), so it seems like a good alternative.
Forgot to mention: this is when using clang-cl, the test file was compiled via "clang-cl -c -O2 -FA clang_cl_rotl.cpp".
Should be easy enough to fix the emission code in CGBuiltin.cpp But I think we may need some more work to recognize rotl8/rotl16 This doesn't work. extern "C" unsigned char g(unsigned char a, int b) { return (a << (b & 7)) | (a >> (-b & 7)); } I think we need to teach InstCombiner::narrowRotate about this form.
IMO, rotates are up there with min/max for: "Is it too late to just add an IR intrinsic for this?" :) If we can solve the original problem in clang, that would be great.
Patch for the intrinsic issue is here https://reviews.llvm.org/D46656 We still have problems in the middle end with short and char rotates written directly in C. I may file a separate bug for that.
CGBuiltin.cpp changed in r331943
At this point, I'd favor an intrinsic for the rotates. The likelihood that a someone actually cares if simplify demanded bits works on rotate is close to zero. It's far more likely that they'll open more bugs when our "optimizations" break backend pattern matching.
Speaking as a somewhat frustrated user: yes, an intrinsic would help. Specifically, the problem with pattern-matching for these relatively intricate trees deep in the back-end is that there's a lot of stuff that happens in between the code the user enters and said pattern-matching, and there are a myriad ways for things to go awry in the meantime. For a recent example derived from the exact same code that led to this bug report, see https://bugs.llvm.org/show_bug.cgi?id=35487. I just looked at the generated code with the replacement _rotl64 (function "g" in the above snippet), and noticed that while g fixes my toy repro, I *still* get poor code in the actual use case. Here's a slightly more complicated repro that illustrates the issue I'm seeing in my real use case: // ---- begin #include <stdint.h> #include <string.h> static uint64_t rotl64(uint64_t a, int b) { return (a << (b & 63)) | (a >> (-b & 63)); } uint64_t hash(const void *p, int rot_amount) { static const uint64_t mult = 0x6C7F3EDD8298DF37; // random 64-bit prime uint32_t t; memcpy(&t, p, sizeof(t)); uint64_t prod = (uint64_t)t * mult; return prod + rotl64(prod, rot_amount); } void bulk_insert(uint32_t *hash_table, uint32_t mask, int rot, const uint8_t *p, size_t count) { for (size_t i = 0; i < count; ++i) hash_table[hash(p + i, rot) & mask] = (uint32_t)i; } // ---- end This is all for a simple multiplicative hash: grab 4 bytes (in this case), multiply by some random constant to mix the bits, then build the hash from both the top and bottom bits of the product. "rot_amount" here is chosen outside to suit a (variable) pow2 hash table size. Again with clang-cl 6.0 FINAL, "clang-cl -c -O2 -FA hash_problem.cpp", I get: (just the relevant bits) # ---- begin "?hash@@YA_KPEBXH@Z": # @"\01?hash@@YA_KPEBXH@Z" # %bb.0: movl (%rcx), %eax movabsq $7818036599238221623, %r8 # imm = 0x6C7F3EDD8298DF37 imulq %rax, %r8 movq %r8, %rax movl %edx, %ecx rolq %cl, %rax addq %r8, %rax retq # ---- end So far, so good But then if I look at bulk_insert: (SEH annotations elided to reduce noise) # ---- begin "?bulk_insert@@YAXPEAIIHPEBE_K@Z": # @"\01?bulk_insert@@YAXPEAIIHPEBE_K@Z" # %bb.0: pushq %r15 pushq %r14 pushq %r12 pushq %rsi pushq %rdi pushq %rbx movl %r8d, %r10d movq %rcx, %r8 movq 88(%rsp), %r15 testq %r15, %r15 je .LBB1_5 # %bb.1: movabsq $7818036599238221623, %r11 # imm = 0x6C7F3EDD8298DF37 movl %r10d, %eax andl $63, %eax negl %r10d andl $63, %r10d movl %edx, %r12d movl %r15d, %r14d andl $1, %r14d cmpq $1, %r15 jne .LBB1_6 # %bb.2: xorl %esi, %esi testq %r14, %r14 jne .LBB1_4 jmp .LBB1_5 .LBB1_6: subq %r14, %r15 xorl %esi, %esi .p2align 4, 0x90 .LBB1_7: # =>This Inner Loop Header: Depth=1 movl (%r9,%rsi), %ebx imulq %r11, %rbx movq %rbx, %rdi movl %eax, %ecx shlq %cl, %rdi movq %rbx, %rdx movl %r10d, %ecx shrq %cl, %rdx orl %edx, %edi addl %edi, %ebx andq %r12, %rbx movl %esi, (%r8,%rbx,4) movl 1(%r9,%rsi), %edx imulq %r11, %rdx movq %rdx, %rdi movl %eax, %ecx shlq %cl, %rdi movq %rdx, %rbx movl %r10d, %ecx shrq %cl, %rbx orl %ebx, %edi addl %edi, %edx andq %r12, %rdx leal 1(%rsi), %ecx movl %ecx, (%r8,%rdx,4) addq $2, %rsi cmpq %rsi, %r15 jne .LBB1_7 # %bb.3: testq %r14, %r14 je .LBB1_5 .LBB1_4: movl (%r9,%rsi), %ebx imulq %r11, %rbx movq %rbx, %rdi movl %eax, %ecx shlq %cl, %rdi movq %rbx, %rax movl %r10d, %ecx shrq %cl, %rax orl %eax, %edi addl %edi, %ebx andq %r12, %rbx movl %esi, (%r8,%rbx,4) .LBB1_5: popq %rbx popq %rdi popq %rsi popq %r12 popq %r14 popq %r15 retq # ---- end The computations of "rot_amount & 63" and "-rot_amount & 63" are identified as loop-invariant, hoisted, and apparently this makes the back-end pattern matching fail to recognize the rotates again, so now I get movq %rbx, %rdi movl %eax, %ecx shlq %cl, %rdi movq %rbx, %rax movl %r10d, %ecx shrq %cl, %rax orl %eax, %edi instead of the expected movq %rbx, %rdi movl %eax, %ecx rolq %cl, %rdi And this is still attempting to be a small test case; the actual code I care about is more complicated, and presumably contains even more latent ways for this to go wrong. As far as I see it, there's basically two options here: 1. keep tweaking the frontend sequences and backend DAG pattern matching for every newly-discovered instance of this, until eventually we run out of known failure modes and declare success? :) 2. add rotate intrinsics, and then there's no risk of other optimizations reshaping the dag into something the pattern rules don't recognize. Furthermore, if you're going to rely on pattern matching to recognize rotates, I think the earlier in compilation (and the closer to the original source code) it happens, the better. If I know that the expression for "g" in my original report reliably turns into a rotate-left (when available), that's fine and I can work with that. But there's no way for me to write code in a source language such that I know the resulting DAG will have the rotate patterns I want.
The current argument is that there's no compelling case for rotate IR intrinsics because the backend can always match the (or (shl), (shr (sub))) sequence in its various forms. Here's the first search hit I got (2012) on this topic: https://groups.google.com/forum/#!topic/llvm-dev/9ZreOEi2znI (cc'ing Eli and Chris for thoughts because they replied in that thread) But the example given in comment 7 - where we constant hoist the opposite shift amount out of a loop - seems very compelling to me. Given the block-at-a-time limit of the current DAG, there is no way to preserve that pattern besides moving/duplicating instructions in IR (CodeGenPrepare). Let me minimize the example to reduce distractions: unsigned rotateRight(unsigned a, int b) { return (a >> b) | (a << (32 - b)); // unsafe, but minimal logic } void rotateInLoop(unsigned *x, unsigned N, unsigned *a, int b) { for (unsigned i = 0; i < N; ++i) x[ (a[i] >> b) | (a[i] << (32 - b)) ] = i; // shift amt is loop invariant } ---------------------------------------------------------------------------- rotateRight is what we expect in IR and x86 asm: $ ./clang -O2 ror.c -S -o - -fomit-frame-pointer -fno-unroll-loops -emit-llvm ... %shr = lshr i32 %a, %b %sub = sub nsw i32 32, %b %shl = shl i32 %a, %sub %or = or i32 %shl, %shr $ ./clang -O2 ror.c -S -o - -fomit-frame-pointer -fno-unroll-loops |grep ror rorl %cl, %edi ----------------------------------------------------------------------------- rotateInLoop can't be matched by DAGCombiner::MatchRotate(): for.loop.preheader: %sub = sub nsw i32 32, %b <--- can't see this in the DAG %wide.trip.count = zext i32 %N to i64 br label %for.body for.loop: ... %shr = lshr i32 %a, %b %shl = shl i32 %a, %sub %or = or i32 %shr, %shl <--- that's not a rotate w/o seeing the sub $ ./clang -O2 ror.c -S -o - -fomit-frame-pointer -fno-unroll-loops | grep sh shrq %cl, %r11 shll %cl, %esi
I have not been involved with core LLVM development for some time, but here are 2c: 1) GlobalIsel will eventually fix the "not being able to see across blocks" issue that you're referring to. In the meantime, if this is an important case, you can enhance codegen prepare to handle it. 2) Adding intrinsics for things like rotates and mins and max's have the benefit of allowing instcombine (and frontends) to form them aggressively and pattern match a bunch of things in a target independent way. It has the downside that lots of other optimizations having to know about them (e.g. simplifydemandedbits) to avoid losing performance in other cases. On balance, it seems like LLVM is doing the right thing here, we need to fix isel for LOTS of reasons, and this is underway. Doing the wrong thing here in the interim doesn't seem warranted.
(In reply to Chris Lattner from comment #9) > I have not been involved with core LLVM development for some time, but here > are 2c: > > 1) GlobalIsel will eventually fix the "not being able to see across blocks" > issue that you're referring to. In the meantime, if this is an important > case, you can enhance codegen prepare to handle it. > > 2) Adding intrinsics for things like rotates and mins and max's have the > benefit of allowing instcombine (and frontends) to form them aggressively > and pattern match a bunch of things in a target independent way. It has the > downside that lots of other optimizations having to know about them (e.g. > simplifydemandedbits) to avoid losing performance in other cases. > > On balance, it seems like LLVM is doing the right thing here, we need to fix > isel for LOTS of reasons, and this is underway. Doing the wrong thing here > in the interim doesn't seem warranted. Thanks, Chris! I haven't been following the GlobalIsel progress, so I don't know how far away that is. So if no intrinsics, we need at least the following to improve rotate matching: 1. CGP transform to un-hoist the sub in comment 8. 2. More logic in instcombine for the example in comment 2. I think it's best if we open new bugs for those and link to other existing rotate bugs, so we have a better idea of how many variations of the patterns are out there.
At the risk of beating on a dead horse... 1. There is a significant distinction between rotate-by-literal-constant and rotate-by-variable-amount here. Rotate-by-literal is easily turned into shifting and ORs, and less tricky to detect, because when the shift amounts are integer literals, there's no risk of them ending up in a different BB (or getting turned into a form the backend doesn't recognize by some other transform). Rotate-by-variable-amount is _not_ easy to express correctly using shifts because the "obvious" (x << k) | (x >> (width - k)) runs into UB when k==0. The non-constant expressions for the shift amounts run a higher risk of getting transformed in some way that breaks the pattern the DAG rules look for. 2. There are not that many use cases for variable-amount rotates, but they're widely supported in CPUs and generally efficient, so I'd like to use them as a primitive when they're applicable. Right now, if I want to actually get variable-amount rotates reliably in Clang, I have to use inline ASM wrapped in a bunch of platform #ifdefs. Inline ASM doesn't participate in other optimizations either, to put it mildly. :) 3. Talking _specifically_ about variable-amount shifts, how many optimizations actually touch these to begin with? SimplifyDemandedBits was mentioned, but the only transform I could find in there that does not require a constant shift amount was that "the MSB of the result of an arithmetic shift is the same as the MSB of the original input". It seems to me that the variable-amount form would be better served by an intrinsic, while constant-amount shifts benefit from the work done to optimize regular shifts and bitwise logic operations, and it's undesirable to have another class of shift operation to handle in a bunch of optimization passes. Maybe the best of both worlds is to use an intrinsic for variable rotates, but turn rotates-by-constant into shifts and ORs, which other passes know to work on?
Fair enough, I haven't thought much about the variable size case. I agree that it is significantly different. If you'd like to push for it, please bring it up on llvm-dev, thanks!
Since the original issue has been fixed, should this be closed or turned into an "implement intrinsic" feature request?
We already have added the intrinsic. It's in clang as __builtin_rotateleft8 __builtin_rotateleft16 __builtin_rotateleft32 __builtin_rotateleft64 __builtin_rotateright8 __builtin_rotateright16 __builtin_rotateright32 __builtin_rotateright64
It seems that Clang still codegens the old sequence for `_rot{l,r}` builtins rather than these new LLVM intrinsics, as can be seen in lib/CodeGen/CGBuiltin.cpp at lines 1822-1860.
(In reply to simonas+llvm.org from comment #15) > It seems that Clang still codegens the old sequence for `_rot{l,r}` builtins > rather than these new LLVM intrinsics, as can be seen in > lib/CodeGen/CGBuiltin.cpp at lines 1822-1860. Changing those was in the approved patch: https://reviews.llvm.org/D50924 ...but it caused a compiler crash on a Linux bot using gcc as the host compiler. I asked for help/policy on working around that, but I got no replies.
(In reply to simonas+llvm.org from comment #15) > It seems that Clang still codegens the old sequence for `_rot{l,r}` builtins > rather than these new LLVM intrinsics, as can be seen in > lib/CodeGen/CGBuiltin.cpp at lines 1822-1860. I don't see any bot failures caused by the clang change this time, so should be fixed with: https://reviews.llvm.org/rL347527 Note that there is still ongoing work to optimize/canonicalize funnel-shift and associated patterns. See related bugs to track those.