New user self-registration is disabled due to spam. For an account please email bugs-admin@lists.llvm.org with your e-mail address and full name.

Bug 35869 - _mm_setzero_si64 violates the spec by generating xorps instead of pxor
Summary: _mm_setzero_si64 violates the spec by generating xorps instead of pxor
Status: RESOLVED INVALID
Alias: None
Product: clang
Classification: Unclassified
Component: C++ (show other bugs)
Version: trunk
Hardware: PC All
: P enhancement
Assignee: Unassigned Clang Bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-01-09 02:22 PST by Gonzalo BG
Modified: 2018-01-28 11:01 PST (History)
4 users (show)

See Also:
Fixed By Commit(s): 322525


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Gonzalo BG 2018-01-09 02:22:35 PST
In the `<mmintrin.h>` header, the intrinsic `__mm_setzero_si64` [0] implemented as follows:

static __inline__ __m64 __DEFAULT_FN_ATTRS
_mm_setzero_si64(void)
{
    return (__m64){ 0LL };
}

generates `xorps` [1] violating the spec [2] which says that it should generate the instruction sequence `pxor mm, mm`.

[0]: https://github.com/llvm-mirror/clang/blob/master/lib/Headers/mmintrin.h#L1296
[1]: https://godbolt.org/g/FnQYkf
[2]: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_setzero_si64&expand=4727,4727
Comment 1 Gonzalo BG 2018-01-09 02:54:14 PST
AFAIK in general `pxor` has much better throughput than xorps because while CPUs typically have one floating point unit for logic operations they have many more for integer operations [0].

Since I am not passing any target/cpu flags while compiling, I would expect this to emit a `pxor`. However, if I were targeting SandyBridge or better (a CPU that detects xor $reg,$reg to transform it into a no-op via register renaming) then arguably it doesn't matter (IIRC both instruction sequences have the same length).

[0]: https://software.intel.com/en-us/forums/intel-isa-extensions/topic/279292
Comment 2 Eli Friedman 2018-01-09 08:32:21 PST
Please include a testcase.  You'll get different instruction sequences depending on how you use the result _mm_setzero_si64 (and that's expected because the most efficient way to produce zero depends on how the result is used).
Comment 3 Gonzalo BG 2018-01-09 09:23:46 PST
The test case was provided in the godbolt link, here the code for completenes:

#include <mmintrin.h>
auto foo() { return _mm_setzero_si64(); }

produces:

foo(): 
  xorps xmm0, xmm0
  ret

It just creates a library with a single exported function `foo` that just calls _mm_setzero_si64 (how the result will be used can't be known in this library).
Comment 4 Eli Friedman 2018-01-09 11:52:39 PST
Okay... so that function requests a zero value in an SSE register.

It makes absolutely no sense for that to generate MMX "pxor mm, mm", since that would require moving the result from an MMX register to an SSE register (which is more instructions, and slow).

In terms of "xorps xmm, xmm" vs "pxor xmm, xmm", we generally prefer the former because it has a shorter encoding. There's a domain-fixing pass which changes it to pxor if you use it as the operand of an integer instruction (in the same function).

I think the current state is fine, but if you have a benchmark showing a problem, we could reconsider.
Comment 5 Gonzalo BG 2018-01-09 11:57:00 PST
> Okay... so that function requests a zero value in an SSE register.

Oh. I screwed up, I thought that `__m64` was always going to be an MMX register. What is happening in this case? Is the calling convention requiring it to be turned into an SSE register?
Comment 6 Eli Friedman 2018-01-09 12:04:24 PST
(In reply to Gonzalo BG from comment #5)
> > Okay... so that function requests a zero value in an SSE register.
> 
> Oh. I screwed up, I thought that `__m64` was always going to be an MMX
> register. What is happening in this case? Is the calling convention
> requiring it to be turned into an SSE register?

Yes, exactly; the x86-64 ABI rules say that __m64 is passed/returned in SSE registers.
Comment 7 Gonzalo BG 2018-01-09 12:05:18 PST
Thanks @Eli, you just saved me a lot of time!
Comment 8 Craig Topper 2018-01-09 12:08:03 PST
I'm still not sure we'll generate a "pxor mm, mm" if its being passed to an mmx instruction and not being pass in or out of a function.
Comment 9 Gonzalo BG 2018-01-09 12:17:40 PST
@Craig https://godbolt.org/g/c6xdDj 

Here:

#include <xmmintrin.h>
auto foo() {
    __m64 a{64};
    return _mm_cvtpu8_ps(a);
}

_mm_cvtpu8_ps is implemented on top of _mm_setzero_si64 and if __m64 is within the function body generates the following code, which uses pxor: 

foo():                                # @foo()
        mov     eax, 255
        movd    xmm0, eax
        mov     eax, 64
        movd    xmm1, eax
        pand    xmm1, xmm0
        movq    qword ptr [rsp - 24], xmm1
        pxor    xmm0, xmm0
        movq    qword ptr [rsp - 16], xmm0
        movq    mm0, qword ptr [rsp - 24]
        punpcklbw       mm0, qword ptr [rsp - 16] # mm0 = mm0[0],mem[0],mm0[1],mem[1],mm0[2],mem[2],mm0[3],mem[3]
        movq    qword ptr [rsp - 8], xmm0
        movq    mm1, qword ptr [rsp - 8]
        pcmpgtw mm1, mm0
        movq    mm2, mm0
        punpckhwd       mm2, mm1        # mm2 = mm2[2],mm1[2],mm2[3],mm1[3]
        cvtpi2ps        xmm0, mm2
        movlhps xmm0, xmm0              # xmm0 = xmm0[0,0]
        punpcklwd       mm0, mm1        # mm0 = mm0[0],mm1[0],mm0[1],mm1[1]
        cvtpi2ps        xmm0, mm0
        ret

However, if I move __m64 a to a function argument, then the dissasembly uses xorps because of what Eli Friedman mentioned above. The __m64 in a function argument is required to be in an SSE register by the calling convention, and moving it to an MMX register and then doing pxor is slower than just doing a xorps.
Comment 10 Craig Topper 2018-01-09 12:25:14 PST
But it generated a pxor to an xmm register not an mmx register. And then it spilled to stack and reloaded into an mmx register. So it’s still not ideal or to spec.
Comment 11 Simon Pilgrim 2018-01-10 09:18:51 PST
(In reply to Craig Topper from comment #10)
> But it generated a pxor to an xmm register not an mmx register. And then it
> spilled to stack and reloaded into an mmx register. So it’s still not ideal
> or to spec.

https://reviews.llvm.org/D41908 should help with this
Comment 12 Simon Pilgrim 2018-01-16 03:46:25 PST
(In reply to Simon Pilgrim from comment #11)
> (In reply to Craig Topper from comment #10)
> > But it generated a pxor to an xmm register not an mmx register. And then it
> > spilled to stack and reloaded into an mmx register. So it’s still not ideal
> > or to spec.
> 
> https://reviews.llvm.org/D41908 should help with this

Committed at rL322525
Comment 13 Gonzalo BG 2018-01-16 04:02:07 PST
Thanks!
Comment 14 Gonzalo BG 2018-01-28 10:14:36 PST
A similar issue exists with _mm256_xor_si256:

This code (see it live: https://godbolt.org/g/jJHBMi):

#include <immintrin.h>

__attribute__((__always_inline__, __nodebug__, __target__("avx2")))
__m256i foo(__m256i a, __m256i b) {
    auto c = _mm256_add_epi64(a, b);
    return _mm256_xor_si256(a, c);
}

__attribute__((__always_inline__, __nodebug__, __target__("avx2")))
__m256i bar(__m256i a, __m256i b) {
    return _mm256_xor_si256(a, b);
}

generates this assembly: 

foo(long long __vector(4), long long __vector(4)):                          # @foo(long long __vector(4), long long __vector(4))
        push    rbp
        mov     rbp, rsp
        and     rsp, -32
        sub     rsp, 32
        vmovdqa ymm0, ymmword ptr [rbp + 16]
        vpaddq  ymm1, ymm0, ymmword ptr [rbp + 48]
        vpxor   ymm0, ymm1, ymm0
        mov     rsp, rbp
        pop     rbp
        ret
bar(long long __vector(4), long long __vector(4)):                          # @bar(long long __vector(4), long long __vector(4))
        push    rbp
        mov     rbp, rsp
        and     rsp, -32
        sub     rsp, 32
        vmovaps ymm0, ymmword ptr [rbp + 48]
        vxorps  ymm0, ymm0, ymmword ptr [rbp + 16]
        mov     rsp, rbp
        pop     rbp
        ret

So it looks to me that LLVM/clang are choosing vpxor if the operations happen on the integer domain and vxorps otherwise.

However, I still do wonder why is vxorps prefered by default on integer vectors instead of vpxor. Shouldn't it be the other way around? Modern CPUs have more ALUs for vector integers operations than for floating point (~3 vs 1). Why is `vxorps` preferred over `vpxor` for integer vectors when no other integer operations happen?
Comment 15 Craig Topper 2018-01-28 10:20:47 PST
This is somewhat of a historical artifact of the fact that SSE xorps has a shorter encoding than SSE2 pxor.

Someone needs to finish driving this long stale patch through https://reviews.llvm.org/D7401
Comment 16 Gonzalo BG 2018-01-28 10:41:04 PST
I've pinged chandler on IRC but maybe someone with phabricator access could ping him again on the review?
Comment 17 Simon Pilgrim 2018-01-28 10:50:38 PST
(In reply to Gonzalo BG from comment #16)
> I've pinged chandler on IRC but maybe someone with phabricator access could
> ping him again on the review?

Please raise a new bug for this if its a problem - this bug was about MMX (and was closed as invalid....), although we did end doing some improvements on the back of it.

The x86 domain code has seen a lot of work since Feb 2015 so D7401 is going to need a considerable rewrite.
Comment 18 Gonzalo BG 2018-01-28 11:01:08 PST
I've filled https://bugs.llvm.org/show_bug.cgi?id=36127