Reproducer: - #include <xmmintrin.h> __m128i f(__m128i a, __m128i b) { __m128i a_shifted = _mm_slli_epi16(a, 2); __m128i b_shifted = _mm_slli_epi16(b, 2); return _mm_avg_epu16(a_shifted, b_shifted); } - with clang 5.0.0 or older, I get the expected: - f(long long __vector(2), long long __vector(2)): psllw xmm0, 2 psllw xmm1, 2 pavgw xmm0, xmm1 ret - but looks like sometime between 5.0.0 and 6.0.0, things broke and now the same code results in: - .LCPI0_0: .short 1 # 0x1 .short 1 # 0x1 .short 1 # 0x1 .short 1 # 0x1 .short 1 # 0x1 .short 1 # 0x1 .short 1 # 0x1 .short 1 # 0x1 f(long long __vector(2), long long __vector(2)): psllw xmm0, 2 psllw xmm1, 2 pxor xmm2, xmm2 movdqa xmm3, xmm1 punpckhwd xmm3, xmm2 punpcklwd xmm1, xmm2 por xmm0, xmmword ptr [rip + .LCPI0_0] movdqa xmm4, xmm0 punpckhwd xmm4, xmm2 paddd xmm4, xmm3 punpcklwd xmm0, xmm2 paddd xmm0, xmm1 pslld xmm4, 15 psrad xmm4, 16 pslld xmm0, 15 psrad xmm0, 16 packssdw xmm0, xmm4 ret - everything after the "psllw xmm1, 2" appears to be a replacement expansion for pavgw. This particular example was compiled with 8.0.0 but 6.x and 7.x are similar. If I just shift one of the inputs, I do get a pavgw; so the left shifts appear to be important in some way. (Reduced from a more complex example.)
https://gcc.godbolt.org/z/t-kLDs define <2 x i64> @_Z1fDv2_xS_(<2 x i64>, <2 x i64>) { %3 = bitcast <2 x i64> %0 to <8 x i16> %4 = shl <8 x i16> %3, <i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2> %5 = bitcast <2 x i64> %1 to <8 x i16> %6 = shl <8 x i16> %5, <i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2> %7 = zext <8 x i16> %6 to <8 x i32> %8 = or <8 x i16> %4, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1> %9 = zext <8 x i16> %8 to <8 x i32> %10 = add nuw nsw <8 x i32> %9, %7 %11 = lshr <8 x i32> %10, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1> %12 = trunc <8 x i32> %11 to <8 x i16> %13 = bitcast <8 x i16> %12 to <2 x i64> ret <2 x i64> %13 } We expand avg to trunc(lshr(add(add(zext(a),1),zext(b)),1)) internally, the shifts are allowing the add with 1 to be turned into an or with 1 which causes the avg lowering to fail in the backend.
Fixed in trunk rL357351. Given that its an older regression its unlikely we'll pick this up in the 8.0 branch.