Created attachment 16281 [details] Repro C file ARM64 ASRV, LSLV, LSRV, RORV are specified in the ARMv8 ARM as interpreting the shift distance "MOD datasize", that is, modulo 32 for 32-bit shifts and modulo 64 for 64-bit shifts. Since C/C++ shifts larger than the register width are UB, it is desirable that the (well-defined, portable) alternatives "x >> (y & 31)" (32-bit x) and "x >> (y & 63)" (64-bit x) not generate an extra "and", but as of Clang-3.8 they do. Trivial reproducer is attached: "clang -O2 -c -S -target arm64-apple-darwin test.c". The expected result is for f to generate "lsr w0, w0, w1; ret" and g "lsr x0, x0, x1; ret", but with Clang-3.8 explicit ANDs are present.
The x86 backend handles this via pattern matching in isel - see "MaskedShiftAmountPats". But I think this could be a target-independent InstCombine because the LangRef specifies undefined behavior for an oversized shift: http://llvm.org/docs/LangRef.html#lshr-instruction So: define i32 @f(i32 %a, i32 %b) { %and = and i32 %b, 31 %shr = lshr i32 %a, %and ret i32 %shr } Should become: define i32 @f(i32 %a, i32 %b) { %shr = lshr i32 %a, %and ret i32 %shr }
(In reply to comment #1) > define i32 @f(i32 %a, i32 %b) { > %shr = lshr i32 %a, %and > ret i32 %shr > } Er: define i32 @f(i32 %a, i32 %b) #0 { %shr = lshr i32 %a, %b <--- if %b is over 31, it's UB in LLVM ret i32 %shr }
Here's an InstCombine patch for demanded bits that should do what we want, but I think this exposes a logic bug somewhere else in InstCombine, so I need to fix that before proceeding: Index: lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp =================================================================== --- lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp (revision 268229) +++ lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp (working copy) @@ -234,6 +234,24 @@ if (Depth == 0 && !V->hasOneUse()) DemandedMask = APInt::getAllOnesValue(BitWidth); + // For any shift instruction, try to simplify the shifted amount operand based + // on the demanded bits subset that are valid. For all shifts, it is undefined + // behavior if the shift amount is greater than or equal to the number of bits + // in the shifted operand. + auto simplifyShiftAmt = [this](Instruction *I, APInt DemandedMask, + APInt &KnownZero, APInt &KnownOne, + unsigned Depth) { + assert(I->isShift() && "Shift instructions only"); + + // computeKnownBits() handles vector operands by taking the intersection + // of the elements, so we don't need to deal with that here. + uint32_t BitWidth = I->getType()->getScalarSizeInBits(); + unsigned NumValidShiftBits = llvm::Log2_32_Ceil(BitWidth); + DemandedMask &= APInt::getLowBitsSet(BitWidth, NumValidShiftBits); + return SimplifyDemandedBits(I->getOperandUse(1), DemandedMask, KnownZero, + KnownOne, Depth + 1); + }; + switch (I->getOpcode()) { default: computeKnownBits(I, KnownZero, KnownOne, Depth, CxtI); @@ -583,6 +601,9 @@ // low bits known zero. if (ShiftAmt) KnownZero |= APInt::getLowBitsSet(BitWidth, ShiftAmt); + } else { + if (simplifyShiftAmt(I, DemandedMask, RHSKnownZero, RHSKnownOne, Depth)) + return I; } break; case Instruction::LShr: @@ -609,6 +630,9 @@ APInt HighBits(APInt::getHighBitsSet(BitWidth, ShiftAmt)); KnownZero |= HighBits; // high bits known zero. } + } else { + if (simplifyShiftAmt(I, DemandedMask, RHSKnownZero, RHSKnownOne, Depth)) + return I; } break; case Instruction::AShr: @@ -669,6 +693,9 @@ } else if ((KnownOne & SignBit) != 0) { // New bits are known one. KnownOne |= HighBits; } + } else { + if (simplifyShiftAmt(I, DemandedMask, RHSKnownZero, RHSKnownOne, Depth)) + return I; } break; case Instruction::SRem:
The patch in comment 3 has at least 2 bugs of its own, but I still think there's an external bug to chase down.
That is definitely not a valid instcombine because it change program behavior. You're creating UB where there wasn't UB before. You can only remove AND masking in target-specific SelectionDAG with target-specific nodes that are defined to have target-specific behavior on shift sizes >= bit width.
(In reply to comment #5) > That is definitely not a valid instcombine because it change program > behavior. You're creating UB where there wasn't UB before. > > You can only remove AND masking in target-specific SelectionDAG with > target-specific nodes that are defined to have target-specific behavior on > shift sizes >= bit width. Yikes, yes. It does seem like a common enough case that there should be a DAGCombine that uses a target-hook for this. PPC, ARM, and x86 all have the same masking behavior in HW from what I see.
PPC has different behavior. See e.g. https://www.ibm.com/support/knowledgecenter/ssw_aix_72/com.ibm.aix.alangref/idalangref_sraw_instrs.htm or https://www.ibm.com/support/knowledgecenter/ssw_aix_72/com.ibm.aix.alangref/idalangref_srd_instrs.htm The official descriptions are specified in a fairly roundabout way, but it boils down to this: uint32_t PPC_srw(uint32_t a, uint32_t b) { b &= 63; return (b < 32) ? (a >> b) : 0; } uint64_t PPC_srd(uint64_t a, uint64_t b) { b &= 127; return (b < 64) ? (a >> b) : 0; } which is quite different from what x86 and ARM64 do. I believe 32-bit ARM does something different still.
I believe the rules are: x86: modulo bitsize ARM32: modulo 256 PPC: modulo bitsize*2 (like Fabian pointed out) x86 SSE: modulo 2^64 (I'm not kidding) there is *definitely* no consistent behavior here.
(In reply to comment #8) > I believe the rules are: > x86: modulo bitsize > ARM32: modulo 256 > PPC: modulo bitsize*2 (like Fabian pointed out) > x86 SSE: modulo 2^64 (I'm not kidding) > > there is *definitely* no consistent behavior here. ARM NEON (both AArch64 and A32): shift by (int8_t) (dist % 256) - negative values turn left shifts into right shifts (there's no variable right shift instr) PPC VMX: modulo bitsize so not only is there a plethora of behaviors, everyone seems to have different rules between the scalar and vector parts of their ISA.
Unless I'm still out to lunch, there is something missing in the IR's known bits calcs: define i32 @shift_amount_is_either_zero_or_bogus(i32 %a, i32 %b) { %and = and i32 %b, 1024 %shl = shl i32 %a, %and ret i32 %shl } define i32 @shift_amount_is_known_bogus(i32 %a, i32 %b) { %or = or i32 %b, 1024 %shl = shl i32 %a, %or ret i32 %shl } There's no IR optimization happening in those cases.
SimplifyDemandedUseBits doesn't seem to have any cases for the RHS of shifts, which might be related.
Ping! Related issue: "a >> (64 - b)" define i64 @f(i64 %a, i64 %b) { %t0 = sub i64 64, %b %t1 = lshr i64 %a, %t0 ret i64 %t1 } produces orr w8, wzr, #0x40 sub x8, x8, x1 lsr x0, x0, x8 when it could instead be neg x8, x1 lsr x0, x0, x8 in this case, it saves an instruction. If there's multiple such shifts in a row, the literal 64 is only materialized once, but still (unnecessarily) takes up a register.
(In reply to comment #10) > Unless I'm still out to lunch, there is something missing in the IR's known > bits calcs: > > define i32 @shift_amount_is_either_zero_or_bogus(i32 %a, i32 %b) { > %and = and i32 %b, 1024 > %shl = shl i32 %a, %and > ret i32 %shl > } > > define i32 @shift_amount_is_known_bogus(i32 %a, i32 %b) { > %or = or i32 %b, 1024 > %shl = shl i32 %a, %or > ret i32 %shl > } > > There's no IR optimization happening in those cases. I don't remember exactly when, but we get those cases now: $ ./opt -instcombine 27582.ll -S ... define i32 @shift_amount_is_either_zero_or_bogus(i32 %a, i32 %b) { ret i32 %a } define i32 @shift_amount_is_known_bogus(i32 %a, i32 %b) { ret i32 undef }
The change below should address the original issue on aarch64: D46844: [AArch64] Take advantage of variable shift/rotate amount implicit mod operation. https://reviews.llvm.org/D46844
Original issue resolved by commit https://reviews.llvm.org/rL333214