LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 41316 - _mm_avg_epu16 preceded by a left shift results in poor code
Summary: _mm_avg_epu16 preceded by a left shift results in poor code
Status: RESOLVED FIXED
Alias: None
Product: libraries
Classification: Unclassified
Component: Backend: X86 (show other bugs)
Version: 8.0
Hardware: PC Linux
: P enhancement
Assignee: Simon Pilgrim
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-03-30 02:23 PDT by Fabian Giesen
Modified: 2019-03-30 11:10 PDT (History)
4 users (show)

See Also:
Fixed By Commit(s): r357351


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Fabian Giesen 2019-03-30 02:23:01 PDT
Reproducer:

-

#include <xmmintrin.h>

__m128i f(__m128i a, __m128i b)
{
  __m128i a_shifted = _mm_slli_epi16(a, 2);
  __m128i b_shifted = _mm_slli_epi16(b, 2);
  return _mm_avg_epu16(a_shifted, b_shifted);
}

-

with clang 5.0.0 or older, I get the expected:

-

f(long long __vector(2), long long __vector(2)):
        psllw   xmm0, 2
        psllw   xmm1, 2
        pavgw   xmm0, xmm1
        ret

-

but looks like sometime between 5.0.0 and 6.0.0, things broke and now the same code results in:

-

.LCPI0_0:
        .short  1                       # 0x1
        .short  1                       # 0x1
        .short  1                       # 0x1
        .short  1                       # 0x1
        .short  1                       # 0x1
        .short  1                       # 0x1
        .short  1                       # 0x1
        .short  1                       # 0x1
f(long long __vector(2), long long __vector(2)):
        psllw   xmm0, 2
        psllw   xmm1, 2
        pxor    xmm2, xmm2
        movdqa  xmm3, xmm1
        punpckhwd       xmm3, xmm2
        punpcklwd       xmm1, xmm2
        por     xmm0, xmmword ptr [rip + .LCPI0_0]
        movdqa  xmm4, xmm0
        punpckhwd       xmm4, xmm2
        paddd   xmm4, xmm3
        punpcklwd       xmm0, xmm2
        paddd   xmm0, xmm1
        pslld   xmm4, 15
        psrad   xmm4, 16
        pslld   xmm0, 15
        psrad   xmm0, 16
        packssdw        xmm0, xmm4
        ret

-

everything after the "psllw xmm1, 2" appears to be a replacement expansion for pavgw. This particular example was compiled with 8.0.0 but 6.x and 7.x are similar.

If I just shift one of the inputs, I do get a pavgw; so the left shifts appear to be important in some way. (Reduced from a more complex example.)
Comment 1 Simon Pilgrim 2019-03-30 03:44:33 PDT
https://gcc.godbolt.org/z/t-kLDs

define <2 x i64> @_Z1fDv2_xS_(<2 x i64>, <2 x i64>) {
  %3 = bitcast <2 x i64> %0 to <8 x i16>
  %4 = shl <8 x i16> %3, <i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2>
  %5 = bitcast <2 x i64> %1 to <8 x i16>
  %6 = shl <8 x i16> %5, <i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2>
  %7 = zext <8 x i16> %6 to <8 x i32>
  %8 = or <8 x i16> %4, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
  %9 = zext <8 x i16> %8 to <8 x i32>
  %10 = add nuw nsw <8 x i32> %9, %7
  %11 = lshr <8 x i32> %10, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
  %12 = trunc <8 x i32> %11 to <8 x i16>
  %13 = bitcast <8 x i16> %12 to <2 x i64>
  ret <2 x i64> %13
}

We expand avg to trunc(lshr(add(add(zext(a),1),zext(b)),1)) internally, the shifts are allowing the add with 1 to be turned into an or with 1 which causes the avg lowering to fail in the backend.
Comment 2 Simon Pilgrim 2019-03-30 11:10:02 PDT
Fixed in trunk rL357351.

Given that its an older regression its unlikely we'll pick this up in the 8.0 branch.