define <16 x i16> @test(<16 x i16> %a, <16 x i16> %b) { %shr = lshr <16 x i16> %a, %b %shuf = shufflevector <16 x i16> zeroinitializer, <16 x i16> %shr, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 30, i32 15> ret <16 x i16> %shuf } With -mcpu=haswell results in: vpxor %xmm2, %xmm2, %xmm2 vpunpckhwd %ymm2, %ymm1, %ymm1 # ymm1 = ymm1[4],ymm2[4],ymm1[5],ymm2[5],ymm1[6],ymm2[6],ymm1[7],ymm2[7],ymm1[12],ymm2[12],ymm1[13],ymm2[13],ymm1[14],ymm2[14],ymm1[15],ymm2[15] vpunpckhwd %ymm0, %ymm2, %ymm0 # ymm0 = ymm2[4],ymm0[4],ymm2[5],ymm0[5],ymm2[6],ymm0[6],ymm2[7],ymm0[7],ymm2[12],ymm0[12],ymm2[13],ymm0[13],ymm2[14],ymm0[14],ymm2[15],ymm0[15] vpsrlvd %ymm1, %ymm0, %ymm0 vpand .LCPI0_0(%rip), %ymm0, %ymm0 While in LLVM 7 it was: vpxor xmm2, xmm2, xmm2 vpunpckhwd ymm1, ymm1, ymm2 # ymm1 = ymm1[4],ymm2[4],ymm1[5],ymm2[5],ymm1[6],ymm2[6],ymm1[7],ymm2[7],ymm1[12],ymm2[12],ymm1[13],ymm2[13],ymm1[14],ymm2[14],ymm1[15],ymm2[15] vpunpckhwd ymm0, ymm2, ymm0 # ymm0 = ymm2[4],ymm0[4],ymm2[5],ymm0[5],ymm2[6],ymm0[6],ymm2[7],ymm0[7],ymm2[12],ymm0[12],ymm2[13],ymm0[13],ymm2[14],ymm0[14],ymm2[15],ymm0[15] vpsrlvd ymm0, ymm0, ymm1 vpand ymm0, ymm0, ymmword ptr [rip + .LCPI0_0] Godbolt: https://godbolt.org/z/4SyEhQ I *think* this transformation is not correct, though maybe my vector foo is too weak. The debug log has: With: t53: v32i8 = BUILD_VECTOR undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, Constant:i8<26>, Constant:i8<27>, undef:i8, undef:i8 Combining: t50: v32i8 = X86ISD::PSHUFB t48, t53 Creating new node: t54: v8i32 = undef Creating new node: t55: v16i16 = bitcast t23 Creating constant: t56: i8 = Constant<-28> Creating new node: t57: v16i16 = X86ISD::PSHUFLW t55, Constant:i8<-28> Creating new node: t58: v32i8 = bitcast t57 ... into: t58: v32i8 = bitcast t57 Which looks like a non-identity pshufb is replaced with an identity pshuflw. This happens via matchUnaryPermuteShuffle(), though I haven't looked further.
Err sorry, I ended up copy&pasting the wrong outputs. This is LLVM trunk: vpxor xmm2, xmm2, xmm2 vpunpckhwd ymm1, ymm1, ymm2 # ymm1 = ymm1[4],ymm2[4],ymm1[5],ymm2[5],ymm1[6],ymm2[6],ymm1[7],ymm2[7],ymm1[12],ymm2[12],ymm1[13],ymm2[13],ymm1[14],ymm2[14],ymm1[15],ymm2[15] vpunpckhwd ymm0, ymm2, ymm0 # ymm0 = ymm2[4],ymm0[4],ymm2[5],ymm0[5],ymm2[6],ymm0[6],ymm2[7],ymm0[7],ymm2[12],ymm0[12],ymm2[13],ymm0[13],ymm2[14],ymm0[14],ymm2[15],ymm0[15] vpsrlvd ymm0, ymm0, ymm1 vpand ymm0, ymm0, ymmword ptr [rip + .LCPI0_0] This is LLVM 7: vpxor xmm2, xmm2, xmm2 vpunpckhwd ymm1, ymm1, ymm2 # ymm1 = ymm1[4],ymm2[4],ymm1[5],ymm2[5],ymm1[6],ymm2[6],ymm1[7],ymm2[7],ymm1[12],ymm2[12],ymm1[13],ymm2[13],ymm1[14],ymm2[14],ymm1[15],ymm2[15] vpunpckhwd ymm0, ymm2, ymm0 # ymm0 = ymm2[4],ymm0[4],ymm2[5],ymm0[5],ymm2[6],ymm0[6],ymm2[7],ymm0[7],ymm2[12],ymm0[12],ymm2[13],ymm0[13],ymm2[14],ymm0[14],ymm2[15],ymm0[15] vpsrlvd ymm0, ymm0, ymm1 vpshufb ymm0, ymm0, ymmword ptr [rip + .LCPI0_0] # ymm0 = zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,ymm0[26,27],zero,zero
I'll take a look
Looks like this is due to a typo... The PSHUL/HW code is constructing LoMask and HiMask from Mask rather than RepeatedMask. In this case the low half of the original mask is all-undef so it ends up constructing an identity shuffle.
https://reviews.llvm.org/D67314
Test in https://reviews.llvm.org/rL371305 and fix in https://reviews.llvm.org/rL371307. If possible, it would be good to have this in LLVM 9. This bug has the dubious honor of miscompiling an AES implementation :(
Merged to release_90 in r371378.