LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 27582 - ARM64: Backend should know about implicit masking of variable shift distances
Summary: ARM64: Backend should know about implicit masking of variable shift distances
Status: RESOLVED FIXED
Alias: None
Product: libraries
Classification: Unclassified
Component: Backend: AArch64 (show other bugs)
Version: 4.0
Hardware: PC Windows NT
: P normal
Assignee: Unassigned LLVM Bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-04-29 20:10 PDT by Fabian Giesen
Modified: 2018-05-24 11:33 PDT (History)
5 users (show)

See Also:
Fixed By Commit(s):


Attachments
Repro C file (167 bytes, text/plain)
2016-04-29 20:10 PDT, Fabian Giesen
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Fabian Giesen 2016-04-29 20:10:03 PDT
Created attachment 16281 [details]
Repro C file

ARM64 ASRV, LSLV, LSRV, RORV are specified in the ARMv8 ARM as interpreting the shift distance "MOD datasize", that is, modulo 32 for 32-bit shifts and modulo 64 for 64-bit shifts.

Since C/C++ shifts larger than the register width are UB, it is desirable that the (well-defined, portable) alternatives "x >> (y & 31)" (32-bit x) and "x >> (y & 63)" (64-bit x) not generate an extra "and", but as of Clang-3.8 they do.

Trivial reproducer is attached: "clang -O2 -c -S -target arm64-apple-darwin test.c".

The expected result is for f to generate "lsr w0, w0, w1; ret" and g "lsr x0, x0, x1; ret", but with Clang-3.8 explicit ANDs are present.
Comment 1 Sanjay Patel 2016-05-01 12:41:14 PDT
The x86 backend handles this via pattern matching in isel - see "MaskedShiftAmountPats".

But I think this could be a target-independent InstCombine because the LangRef specifies undefined behavior for an oversized shift:
http://llvm.org/docs/LangRef.html#lshr-instruction

So:
define i32 @f(i32 %a, i32 %b) {
  %and = and i32 %b, 31
  %shr = lshr i32 %a, %and
  ret i32 %shr
}

Should become:
define i32 @f(i32 %a, i32 %b) {
  %shr = lshr i32 %a, %and
  ret i32 %shr
}
Comment 2 Sanjay Patel 2016-05-01 12:46:05 PDT
(In reply to comment #1)
> define i32 @f(i32 %a, i32 %b) {
>   %shr = lshr i32 %a, %and
>   ret i32 %shr
> }

Er:
define i32 @f(i32 %a, i32 %b) #0 {
  %shr = lshr i32 %a, %b   <--- if %b is over 31, it's UB in LLVM
  ret i32 %shr
}
Comment 3 Sanjay Patel 2016-05-02 11:15:38 PDT
Here's an InstCombine patch for demanded bits that should do what we want, but I think this exposes a logic bug somewhere else in InstCombine, so I need to fix that before proceeding:

Index: lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp
===================================================================
--- lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp	(revision 268229)
+++ lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp	(working copy)
@@ -234,6 +234,24 @@
   if (Depth == 0 && !V->hasOneUse())
     DemandedMask = APInt::getAllOnesValue(BitWidth);
 
+  // For any shift instruction, try to simplify the shifted amount operand based
+  // on the demanded bits subset that are valid. For all shifts, it is undefined
+  // behavior if the shift amount is greater than or equal to the number of bits
+  // in the shifted operand.
+  auto simplifyShiftAmt = [this](Instruction *I, APInt DemandedMask,
+                                 APInt &KnownZero, APInt &KnownOne,
+                                 unsigned Depth) {
+    assert(I->isShift() && "Shift instructions only");
+
+    // computeKnownBits() handles vector operands by taking the intersection
+    // of the elements, so we don't need to deal with that here.
+    uint32_t BitWidth = I->getType()->getScalarSizeInBits();
+    unsigned NumValidShiftBits = llvm::Log2_32_Ceil(BitWidth);
+    DemandedMask &= APInt::getLowBitsSet(BitWidth, NumValidShiftBits);
+    return SimplifyDemandedBits(I->getOperandUse(1), DemandedMask, KnownZero,
+                                KnownOne, Depth + 1);
+  };
+
   switch (I->getOpcode()) {
   default:
     computeKnownBits(I, KnownZero, KnownOne, Depth, CxtI);
@@ -583,6 +601,9 @@
       // low bits known zero.
       if (ShiftAmt)
         KnownZero |= APInt::getLowBitsSet(BitWidth, ShiftAmt);
+    } else {
+      if (simplifyShiftAmt(I, DemandedMask, RHSKnownZero, RHSKnownOne, Depth))
+        return I;
     }
     break;
   case Instruction::LShr:
@@ -609,6 +630,9 @@
         APInt HighBits(APInt::getHighBitsSet(BitWidth, ShiftAmt));
         KnownZero |= HighBits;  // high bits known zero.
       }
+    } else {
+      if (simplifyShiftAmt(I, DemandedMask, RHSKnownZero, RHSKnownOne, Depth))
+        return I;
     }
     break;
   case Instruction::AShr:
@@ -669,6 +693,9 @@
       } else if ((KnownOne & SignBit) != 0) { // New bits are known one.
         KnownOne |= HighBits;
       }
+    } else {
+      if (simplifyShiftAmt(I, DemandedMask, RHSKnownZero, RHSKnownOne, Depth))
+        return I;
     }
     break;
   case Instruction::SRem:
Comment 4 Sanjay Patel 2016-05-02 14:17:30 PDT
The patch in comment 3 has at least 2 bugs of its own, but I still think there's an external bug to chase down.
Comment 5 escha 2016-05-02 14:30:59 PDT
That is definitely not a valid instcombine because it change program behavior. You're creating UB where there wasn't UB before.

You can only remove AND masking in target-specific SelectionDAG with target-specific nodes that are defined to have target-specific behavior on shift sizes >= bit width.
Comment 6 Sanjay Patel 2016-05-02 15:41:52 PDT
(In reply to comment #5)
> That is definitely not a valid instcombine because it change program
> behavior. You're creating UB where there wasn't UB before.
> 
> You can only remove AND masking in target-specific SelectionDAG with
> target-specific nodes that are defined to have target-specific behavior on
> shift sizes >= bit width.

Yikes, yes. 

It does seem like a common enough case that there should be a DAGCombine that uses a target-hook for this. PPC, ARM, and x86 all have the same masking behavior in HW from what I see.
Comment 7 Fabian Giesen 2016-05-02 15:49:33 PDT
PPC has different behavior. See e.g.
https://www.ibm.com/support/knowledgecenter/ssw_aix_72/com.ibm.aix.alangref/idalangref_sraw_instrs.htm
or
https://www.ibm.com/support/knowledgecenter/ssw_aix_72/com.ibm.aix.alangref/idalangref_srd_instrs.htm

The official descriptions are specified in a fairly roundabout way, but it boils down to this:

uint32_t PPC_srw(uint32_t a, uint32_t b)
{
    b &= 63;
    return (b < 32) ? (a >> b) : 0;
}

uint64_t PPC_srd(uint64_t a, uint64_t b)
{
    b &= 127;
    return (b < 64) ? (a >> b) : 0;
}

which is quite different from what x86 and ARM64 do. I believe 32-bit ARM does something different still.
Comment 8 escha 2016-05-02 16:00:17 PDT
I believe the rules are:
x86: modulo bitsize
ARM32: modulo 256
PPC: modulo bitsize*2 (like Fabian pointed out)
x86 SSE: modulo 2^64 (I'm not kidding)

there is *definitely* no consistent behavior here.
Comment 9 Fabian Giesen 2016-05-02 16:07:30 PDT
(In reply to comment #8)
> I believe the rules are:
> x86: modulo bitsize
> ARM32: modulo 256
> PPC: modulo bitsize*2 (like Fabian pointed out)
> x86 SSE: modulo 2^64 (I'm not kidding)
> 
> there is *definitely* no consistent behavior here.

ARM NEON (both AArch64 and A32): shift by (int8_t) (dist % 256) - negative values turn left shifts into right shifts (there's no variable right shift instr)
PPC VMX: modulo bitsize

so not only is there a plethora of behaviors, everyone seems to have different rules between the scalar and vector parts of their ISA.
Comment 10 Sanjay Patel 2016-05-02 16:21:58 PDT
Unless I'm still out to lunch, there is something missing in the IR's known bits calcs:

define i32 @shift_amount_is_either_zero_or_bogus(i32 %a, i32 %b) {
  %and = and i32 %b, 1024
  %shl = shl i32 %a, %and
  ret i32 %shl
}

define i32 @shift_amount_is_known_bogus(i32 %a, i32 %b) {
  %or = or i32 %b, 1024
  %shl = shl i32 %a, %or
  ret i32 %shl
}

There's no IR optimization happening in those cases.
Comment 11 escha 2016-05-02 19:47:22 PDT
SimplifyDemandedUseBits doesn't seem to have any cases for the RHS of shifts, which might be related.
Comment 12 Fabian Giesen 2017-02-02 17:58:57 PST
Ping! Related issue: "a >> (64 - b)"

define i64 @f(i64 %a, i64 %b) {
  %t0 = sub i64 64, %b
  %t1 = lshr i64 %a, %t0
  ret i64 %t1
}

produces

    orr     w8, wzr, #0x40
    sub     x8, x8, x1
    lsr     x0, x0, x8

when it could instead be

    neg     x8, x1
    lsr     x0, x0, x8

in this case, it saves an instruction. If there's multiple such shifts in a row, the literal 64 is only materialized once, but still (unnecessarily) takes up a register.
Comment 13 Sanjay Patel 2017-02-02 18:45:15 PST
(In reply to comment #10)
> Unless I'm still out to lunch, there is something missing in the IR's known
> bits calcs:
> 
> define i32 @shift_amount_is_either_zero_or_bogus(i32 %a, i32 %b) {
>   %and = and i32 %b, 1024
>   %shl = shl i32 %a, %and
>   ret i32 %shl
> }
> 
> define i32 @shift_amount_is_known_bogus(i32 %a, i32 %b) {
>   %or = or i32 %b, 1024
>   %shl = shl i32 %a, %or
>   ret i32 %shl
> }
> 
> There's no IR optimization happening in those cases.

I don't remember exactly when, but we get those cases now:

$ ./opt -instcombine 27582.ll -S
...
define i32 @shift_amount_is_either_zero_or_bogus(i32 %a, i32 %b) {
  ret i32 %a
}

define i32 @shift_amount_is_known_bogus(i32 %a, i32 %b) {
  ret i32 undef
}
Comment 14 Geoff Berry 2018-05-14 13:10:39 PDT
The change below should address the original issue on aarch64:

D46844: [AArch64] Take advantage of variable shift/rotate amount implicit mod operation.
https://reviews.llvm.org/D46844
Comment 15 Geoff Berry 2018-05-24 11:33:47 PDT
Original issue resolved by commit https://reviews.llvm.org/rL333214