Created attachment 11871 [details] Small repro with generated code Both of the functions in this file should be equivalent; the former ("good_...") does generate the expected code, but the latter ("bad_...") ends up "de-SIMDifying" the function almost completely. The only difference between the two is that the former declares the magic value at global scope whereas the latter uses the _mm_setr_epi16 intrinsic. This was tested using the official Clang 3.4 release binaries, not trunk, but the "Version" field doesn't have 3.4 yet.
Hi Fabian, Trunk revision 201271 fixed this issue. Now the compiler produces the following SSE code for function 'bad_unpack_2bits_to_16' from your reproducible: ### .LCPI1_0: .short 16384 # 0x4000 .short 4096 # 0x1000 .short 1024 # 0x400 .short 256 # 0x100 .short 64 # 0x40 .short 16 # 0x10 .short 4 # 0x4 .short 1 # 0x1 .text .globl _Z22bad_unpack_2bits_to_16t .align 16, 0x90 .type _Z22bad_unpack_2bits_to_16t,@function _Z22bad_unpack_2bits_to_16t: # @_Z22bad_unpack_2bits_to_16t .cfi_startproc # BB#0: # %entry movd %edi, %xmm0 punpcklwd %xmm0, %xmm0 # xmm0 = xmm0[0,0,1,1,2,2,3,3] pshufd $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0,0,0] pmullw .LCPI1_0(%rip), %xmm0 psrlw $14, %xmm0 retq ###
Resolving this bug as FIXED. Trunk revision 201271 fixed this issue and I don't think there is more work to be done on this. Fabian, could you please verify that this now works for you as well? Thanks.
Yep, bug is fixed. Thanks!