LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 37387 - MSVC rotate intrinsics don't (just) generate rotates on x86-64
Summary: MSVC rotate intrinsics don't (just) generate rotates on x86-64
Status: RESOLVED FIXED
Alias: None
Product: clang
Classification: Unclassified
Component: LLVM Codegen (show other bugs)
Version: 6.0
Hardware: PC Windows NT
: P normal
Assignee: Unassigned Clang Bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-05-08 21:09 PDT by Fabian Giesen
Modified: 2018-11-26 06:35 PST (History)
10 users (show)

See Also:
Fixed By Commit(s):


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Fabian Giesen 2018-05-08 21:09:07 PDT
This simple test:

// ---- begin
#include <intrin.h>

extern "C" unsigned long long f(unsigned long long a, int b)
{
    return _rotl64(a, b);
}

extern "C" unsigned long long g(unsigned long long a, int b)
{
    return (a << (b & 63)) | (a >> (-b & 63));
}
// ---- end

produces (on x86-64 using clang 6.0 release; only quoting the relevant bits):

# ---- begin
f:                                      # @f
# %bb.0:
	movq	%rcx, %r8
	andl	$63, %edx
	movq	%r8, %rax
	movl	%edx, %ecx
	rolq	%cl, %rax
	testl	%edx, %edx
	cmoveq	%r8, %rax
	retq

g:                                      # @g
# %bb.0:
	movq	%rcx, %rax
	movl	%edx, %ecx
	rolq	%cl, %rax
	retq
# ---- end

The corresponding IR is:

; ---- begin
; Function Attrs: norecurse nounwind readnone sspstrong uwtable
define i64 @f(i64, i32) local_unnamed_addr #0 {
  %3 = and i32 %1, 63
  %4 = zext i32 %3 to i64
  %5 = sub nsw i64 64, %4
  %6 = shl i64 %0, %4
  %7 = lshr i64 %0, %5
  %8 = or i64 %7, %6
  %9 = icmp eq i32 %3, 0
  %10 = select i1 %9, i64 %0, i64 %8
  ret i64 %10
}

; Function Attrs: norecurse nounwind readnone sspstrong uwtable
define i64 @g(i64, i32) local_unnamed_addr #0 {
  %3 = and i32 %1, 63
  %4 = zext i32 %3 to i64
  %5 = shl i64 %0, %4
  %6 = sub nsw i32 0, %1
  %7 = and i32 %6, 63
  %8 = zext i32 %7 to i64
  %9 = lshr i64 %0, %8
  %10 = or i64 %5, %9
  ret i64 %10
}

; ---- end

The problem is the expansion chosen for the rotr/rotl intrinsics in CGBuiltin.cpp CodeGenFunction::EmitBuiltinExpr, presumably to avoid implementation-specific behavior from the right shift by 64-b.

Note that the alternative expansion for rotate-left given in the code for g avoids the problematic select, is well-defined, and already gets matched to ROL (in the x86-64 backend anyway), so it seems like a good alternative.
Comment 1 Fabian Giesen 2018-05-08 21:12:49 PDT
Forgot to mention: this is when using clang-cl, the test file was compiled via "clang-cl -c -O2 -FA clang_cl_rotl.cpp".
Comment 2 Craig Topper 2018-05-08 23:30:20 PDT
Should be easy enough to fix the emission code in CGBuiltin.cpp

But I think we may need some more work to recognize rotl8/rotl16

This doesn't work.

extern "C" unsigned char g(unsigned char a, int b)
{
    return (a << (b & 7)) | (a >> (-b & 7));
}

I think we need to teach InstCombiner::narrowRotate about this form.
Comment 3 Sanjay Patel 2018-05-09 07:14:02 PDT
IMO, rotates are up there with min/max for: "Is it too late to just add an IR intrinsic for this?" :)

If we can solve the original problem in clang, that would be great.
Comment 4 Craig Topper 2018-05-09 12:01:54 PDT
Patch for the intrinsic issue is here https://reviews.llvm.org/D46656

We still have problems in the middle end with short and char rotates written directly in C. I may file a separate bug for that.
Comment 5 Craig Topper 2018-05-09 22:23:59 PDT
CGBuiltin.cpp changed in r331943
Comment 6 Reid Kleckner 2018-05-10 14:46:14 PDT
At this point, I'd favor an intrinsic for the rotates. The likelihood that a someone actually cares if simplify demanded bits works on rotate is close to zero. It's far more likely that they'll open more bugs when our "optimizations" break backend pattern matching.
Comment 7 Fabian Giesen 2018-05-10 15:58:32 PDT
Speaking as a somewhat frustrated user: yes, an intrinsic would help.

Specifically, the problem with pattern-matching for these relatively intricate trees deep in the back-end is that there's a lot of stuff that happens in between the code the user enters and said pattern-matching, and there are a myriad ways for things to go awry in the meantime.

For a recent example derived from the exact same code that led to this bug report, see https://bugs.llvm.org/show_bug.cgi?id=35487.

I just looked at the generated code with the replacement _rotl64 (function "g" in the above snippet), and noticed that while g fixes my toy repro, I *still* get poor code in the actual use case.

Here's a slightly more complicated repro that illustrates the issue I'm seeing in my real use case:

// ---- begin
#include <stdint.h>
#include <string.h>

static uint64_t rotl64(uint64_t a, int b)
{
    return (a << (b & 63)) | (a >> (-b & 63));
}

uint64_t hash(const void *p, int rot_amount)
{
    static const uint64_t mult = 0x6C7F3EDD8298DF37; // random 64-bit prime
    uint32_t t;

    memcpy(&t, p, sizeof(t));
    uint64_t prod = (uint64_t)t * mult;
    return prod + rotl64(prod, rot_amount);
}

void bulk_insert(uint32_t *hash_table, uint32_t mask, int rot, const uint8_t *p, size_t count)
{
    for (size_t i = 0; i < count; ++i)
        hash_table[hash(p + i, rot) & mask] = (uint32_t)i;
}
// ---- end

This is all for a simple multiplicative hash: grab 4 bytes (in this case), multiply by some random constant to mix the bits, then build the hash from both the top and bottom bits of the product. "rot_amount" here is chosen outside to suit a (variable) pow2 hash table size.

Again with clang-cl 6.0 FINAL, "clang-cl -c -O2 -FA hash_problem.cpp", I get: (just the relevant bits)

# ---- begin
"?hash@@YA_KPEBXH@Z":                   # @"\01?hash@@YA_KPEBXH@Z"
# %bb.0:
	movl	(%rcx), %eax
	movabsq	$7818036599238221623, %r8 # imm = 0x6C7F3EDD8298DF37
	imulq	%rax, %r8
	movq	%r8, %rax
	movl	%edx, %ecx
	rolq	%cl, %rax
	addq	%r8, %rax
	retq
# ---- end

So far, so good But then if I look at bulk_insert: (SEH annotations elided to reduce noise)

# ---- begin
"?bulk_insert@@YAXPEAIIHPEBE_K@Z":      # @"\01?bulk_insert@@YAXPEAIIHPEBE_K@Z"
# %bb.0:
	pushq	%r15
	pushq	%r14
	pushq	%r12
	pushq	%rsi
	pushq	%rdi
	pushq	%rbx
	movl	%r8d, %r10d
	movq	%rcx, %r8
	movq	88(%rsp), %r15
	testq	%r15, %r15
	je	.LBB1_5
# %bb.1:
	movabsq	$7818036599238221623, %r11 # imm = 0x6C7F3EDD8298DF37
	movl	%r10d, %eax
	andl	$63, %eax
	negl	%r10d
	andl	$63, %r10d
	movl	%edx, %r12d
	movl	%r15d, %r14d
	andl	$1, %r14d
	cmpq	$1, %r15
	jne	.LBB1_6
# %bb.2:
	xorl	%esi, %esi
	testq	%r14, %r14
	jne	.LBB1_4
	jmp	.LBB1_5
.LBB1_6:
	subq	%r14, %r15
	xorl	%esi, %esi
	.p2align	4, 0x90
.LBB1_7:                                # =>This Inner Loop Header: Depth=1
	movl	(%r9,%rsi), %ebx
	imulq	%r11, %rbx
	movq	%rbx, %rdi
	movl	%eax, %ecx
	shlq	%cl, %rdi
	movq	%rbx, %rdx
	movl	%r10d, %ecx
	shrq	%cl, %rdx
	orl	%edx, %edi
	addl	%edi, %ebx
	andq	%r12, %rbx
	movl	%esi, (%r8,%rbx,4)
	movl	1(%r9,%rsi), %edx
	imulq	%r11, %rdx
	movq	%rdx, %rdi
	movl	%eax, %ecx
	shlq	%cl, %rdi
	movq	%rdx, %rbx
	movl	%r10d, %ecx
	shrq	%cl, %rbx
	orl	%ebx, %edi
	addl	%edi, %edx
	andq	%r12, %rdx
	leal	1(%rsi), %ecx
	movl	%ecx, (%r8,%rdx,4)
	addq	$2, %rsi
	cmpq	%rsi, %r15
	jne	.LBB1_7
# %bb.3:
	testq	%r14, %r14
	je	.LBB1_5
.LBB1_4:
	movl	(%r9,%rsi), %ebx
	imulq	%r11, %rbx
	movq	%rbx, %rdi
	movl	%eax, %ecx
	shlq	%cl, %rdi
	movq	%rbx, %rax
	movl	%r10d, %ecx
	shrq	%cl, %rax
	orl	%eax, %edi
	addl	%edi, %ebx
	andq	%r12, %rbx
	movl	%esi, (%r8,%rbx,4)
.LBB1_5:
	popq	%rbx
	popq	%rdi
	popq	%rsi
	popq	%r12
	popq	%r14
	popq	%r15
	retq
# ---- end

The computations of "rot_amount & 63" and "-rot_amount & 63" are identified as loop-invariant, hoisted, and apparently this makes the back-end pattern matching fail to recognize the rotates again, so now I get

	movq	%rbx, %rdi
	movl	%eax, %ecx
	shlq	%cl, %rdi
	movq	%rbx, %rax
	movl	%r10d, %ecx
	shrq	%cl, %rax
	orl	%eax, %edi

instead of the expected

        movq    %rbx, %rdi
        movl    %eax, %ecx
        rolq    %cl, %rdi

And this is still attempting to be a small test case; the actual code I care about is more complicated, and presumably contains even more latent ways for this to go wrong.

As far as I see it, there's basically two options here:
1. keep tweaking the frontend sequences and backend DAG pattern matching for every newly-discovered instance of this, until eventually we run out of known failure modes and declare success? :)
2. add rotate intrinsics, and then there's no risk of other optimizations reshaping the dag into something the pattern rules don't recognize.

Furthermore, if you're going to rely on pattern matching to recognize rotates, I think the earlier in compilation (and the closer to the original source code) it happens, the better. If I know that the expression for "g" in my original report reliably turns into a rotate-left (when available), that's fine and I can work with that. But there's no way for me to write code in a source language such that I know the resulting DAG will have the rotate patterns I want.
Comment 8 Sanjay Patel 2018-05-11 08:19:59 PDT
The current argument is that there's no compelling case for rotate IR intrinsics because the backend can always match the (or (shl), (shr (sub))) sequence in its various forms.

Here's the first search hit I got (2012) on this topic:
https://groups.google.com/forum/#!topic/llvm-dev/9ZreOEi2znI
(cc'ing Eli and Chris for thoughts because they replied in that thread)

But the example given in comment 7 - where we constant hoist the opposite shift amount out of a loop - seems very compelling to me. Given the block-at-a-time limit of the current DAG, there is no way to preserve that pattern besides moving/duplicating instructions in IR (CodeGenPrepare).

Let me minimize the example to reduce distractions:

unsigned rotateRight(unsigned a, int b) {
  return (a >> b) | (a << (32 - b));  // unsafe, but minimal logic
}

void rotateInLoop(unsigned *x, unsigned N, unsigned *a, int b) {
  for (unsigned i = 0; i < N; ++i)
    x[ (a[i] >> b) | (a[i] << (32 - b)) ] = i; // shift amt is loop invariant
}

----------------------------------------------------------------------------

rotateRight is what we expect in IR and x86 asm:

$ ./clang -O2 ror.c -S -o - -fomit-frame-pointer -fno-unroll-loops -emit-llvm
...
  %shr = lshr i32 %a, %b
  %sub = sub nsw i32 32, %b
  %shl = shl i32 %a, %sub
  %or = or i32 %shl, %shr

$ ./clang -O2 ror.c -S -o - -fomit-frame-pointer -fno-unroll-loops |grep ror
	rorl	%cl, %edi

-----------------------------------------------------------------------------

rotateInLoop can't be matched by DAGCombiner::MatchRotate():

for.loop.preheader:
  %sub = sub nsw i32 32, %b <--- can't see this in the DAG
  %wide.trip.count = zext i32 %N to i64
  br label %for.body

for.loop:
...
  %shr = lshr i32 %a, %b
  %shl = shl i32 %a, %sub
  %or = or i32 %shr, %shl   <--- that's not a rotate w/o seeing the sub

$ ./clang -O2 ror.c -S -o - -fomit-frame-pointer -fno-unroll-loops | grep sh
	shrq	%cl, %r11
	shll	%cl, %esi
Comment 9 Chris Lattner 2018-05-11 08:34:05 PDT
I have not been involved with core LLVM development for some time, but here are 2c:

1) GlobalIsel will eventually fix the "not being able to see across blocks" issue that you're referring to.  In the meantime, if this is an important case, you can enhance codegen prepare to handle it.

2) Adding intrinsics for things like rotates and mins and max's have the benefit of allowing instcombine (and frontends) to form them aggressively and pattern match a bunch of things in a target independent way.  It has the downside that lots of other optimizations having to know about them (e.g. simplifydemandedbits) to avoid losing performance in other cases.

On balance, it seems like LLVM is doing the right thing here, we need to fix isel for LOTS of reasons, and this is underway.  Doing the wrong thing here in the interim doesn't seem warranted.
Comment 10 Sanjay Patel 2018-05-11 08:46:36 PDT
(In reply to Chris Lattner from comment #9)
> I have not been involved with core LLVM development for some time, but here
> are 2c:
> 
> 1) GlobalIsel will eventually fix the "not being able to see across blocks"
> issue that you're referring to.  In the meantime, if this is an important
> case, you can enhance codegen prepare to handle it.
> 
> 2) Adding intrinsics for things like rotates and mins and max's have the
> benefit of allowing instcombine (and frontends) to form them aggressively
> and pattern match a bunch of things in a target independent way.  It has the
> downside that lots of other optimizations having to know about them (e.g.
> simplifydemandedbits) to avoid losing performance in other cases.
> 
> On balance, it seems like LLVM is doing the right thing here, we need to fix
> isel for LOTS of reasons, and this is underway.  Doing the wrong thing here
> in the interim doesn't seem warranted.

Thanks, Chris! I haven't been following the GlobalIsel progress, so I don't know how far away that is.

So if no intrinsics, we need at least the following to improve rotate matching:
1. CGP transform to un-hoist the sub in comment 8.
2. More logic in instcombine for the example in comment 2.

I think it's best if we open new bugs for those and link to other existing rotate bugs, so we have a better idea of how many variations of the patterns are out there.
Comment 11 Fabian Giesen 2018-05-11 12:07:20 PDT
At the risk of beating on a dead horse...

1. There is a significant distinction between rotate-by-literal-constant and rotate-by-variable-amount here. Rotate-by-literal is easily turned into shifting and ORs, and less tricky to detect, because when the shift amounts are integer literals, there's no risk of them ending up in a different BB (or getting turned into a form the backend doesn't recognize by some other transform).

Rotate-by-variable-amount is _not_ easy to express correctly using shifts because the "obvious" (x << k) | (x >> (width - k)) runs into UB when k==0. The non-constant expressions for the shift amounts run a higher risk of getting transformed in some way that breaks the pattern the DAG rules look for.

2. There are not that many use cases for variable-amount rotates, but they're widely supported in CPUs and generally efficient, so I'd like to use them as a primitive when they're applicable.

Right now, if I want to actually get variable-amount rotates reliably in Clang, I have to use inline ASM wrapped in a bunch of platform #ifdefs. Inline ASM doesn't participate in other optimizations either, to put it mildly. :)

3. Talking _specifically_ about variable-amount shifts, how many optimizations actually touch these to begin with? SimplifyDemandedBits was mentioned, but the only transform I could find in there that does not require a constant shift amount was that "the MSB of the result of an arithmetic shift is the same as the MSB of the original input".


It seems to me that the variable-amount form would be better served by an intrinsic, while constant-amount shifts benefit from the work done to optimize regular shifts and bitwise logic operations, and it's undesirable to have another class of shift operation to handle in a bunch of optimization passes.

Maybe the best of both worlds is to use an intrinsic for variable rotates, but turn rotates-by-constant into shifts and ORs, which other passes know to work on?
Comment 12 Chris Lattner 2018-05-11 18:44:30 PDT
Fair enough, I haven't thought much about the variable size case.  I agree that it is significantly different.  If you'd like to push for it, please bring it up on llvm-dev, thanks!
Comment 13 Trass3r 2018-11-10 21:12:48 PST
Since the original issue has been fixed, should this be closed or turned into an "implement intrinsic" feature request?
Comment 14 Craig Topper 2018-11-10 22:17:05 PST
We already have added the intrinsic. It's in clang as 

__builtin_rotateleft8
__builtin_rotateleft16
__builtin_rotateleft32
__builtin_rotateleft64
__builtin_rotateright8
__builtin_rotateright16
__builtin_rotateright32
__builtin_rotateright64
Comment 15 simonas+llvm.org 2018-11-10 23:57:21 PST
It seems that Clang still codegens the old sequence for `_rot{l,r}` builtins rather than these new LLVM intrinsics, as can be seen in lib/CodeGen/CGBuiltin.cpp at lines 1822-1860.
Comment 16 Sanjay Patel 2018-11-11 06:47:17 PST
(In reply to simonas+llvm.org from comment #15)
> It seems that Clang still codegens the old sequence for `_rot{l,r}` builtins
> rather than these new LLVM intrinsics, as can be seen in
> lib/CodeGen/CGBuiltin.cpp at lines 1822-1860.

Changing those was in the approved patch:
https://reviews.llvm.org/D50924

...but it caused a compiler crash on a Linux bot using gcc as the host compiler. I asked for help/policy on working around that, but I got no replies.
Comment 17 Sanjay Patel 2018-11-26 06:35:44 PST
(In reply to simonas+llvm.org from comment #15)
> It seems that Clang still codegens the old sequence for `_rot{l,r}` builtins
> rather than these new LLVM intrinsics, as can be seen in
> lib/CodeGen/CGBuiltin.cpp at lines 1822-1860.

I don't see any bot failures caused by the clang change this time, so should be fixed with:
https://reviews.llvm.org/rL347527

Note that there is still ongoing work to optimize/canonicalize funnel-shift and associated patterns. See related bugs to track those.