In the `<mmintrin.h>` header, the intrinsic `__mm_setzero_si64` [0] implemented as follows: static __inline__ __m64 __DEFAULT_FN_ATTRS _mm_setzero_si64(void) { return (__m64){ 0LL }; } generates `xorps` [1] violating the spec [2] which says that it should generate the instruction sequence `pxor mm, mm`. [0]: https://github.com/llvm-mirror/clang/blob/master/lib/Headers/mmintrin.h#L1296 [1]: https://godbolt.org/g/FnQYkf [2]: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_setzero_si64&expand=4727,4727
AFAIK in general `pxor` has much better throughput than xorps because while CPUs typically have one floating point unit for logic operations they have many more for integer operations [0]. Since I am not passing any target/cpu flags while compiling, I would expect this to emit a `pxor`. However, if I were targeting SandyBridge or better (a CPU that detects xor $reg,$reg to transform it into a no-op via register renaming) then arguably it doesn't matter (IIRC both instruction sequences have the same length). [0]: https://software.intel.com/en-us/forums/intel-isa-extensions/topic/279292
Please include a testcase. You'll get different instruction sequences depending on how you use the result _mm_setzero_si64 (and that's expected because the most efficient way to produce zero depends on how the result is used).
The test case was provided in the godbolt link, here the code for completenes: #include <mmintrin.h> auto foo() { return _mm_setzero_si64(); } produces: foo(): xorps xmm0, xmm0 ret It just creates a library with a single exported function `foo` that just calls _mm_setzero_si64 (how the result will be used can't be known in this library).
Okay... so that function requests a zero value in an SSE register. It makes absolutely no sense for that to generate MMX "pxor mm, mm", since that would require moving the result from an MMX register to an SSE register (which is more instructions, and slow). In terms of "xorps xmm, xmm" vs "pxor xmm, xmm", we generally prefer the former because it has a shorter encoding. There's a domain-fixing pass which changes it to pxor if you use it as the operand of an integer instruction (in the same function). I think the current state is fine, but if you have a benchmark showing a problem, we could reconsider.
> Okay... so that function requests a zero value in an SSE register. Oh. I screwed up, I thought that `__m64` was always going to be an MMX register. What is happening in this case? Is the calling convention requiring it to be turned into an SSE register?
(In reply to Gonzalo BG from comment #5) > > Okay... so that function requests a zero value in an SSE register. > > Oh. I screwed up, I thought that `__m64` was always going to be an MMX > register. What is happening in this case? Is the calling convention > requiring it to be turned into an SSE register? Yes, exactly; the x86-64 ABI rules say that __m64 is passed/returned in SSE registers.
Thanks @Eli, you just saved me a lot of time!
I'm still not sure we'll generate a "pxor mm, mm" if its being passed to an mmx instruction and not being pass in or out of a function.
@Craig https://godbolt.org/g/c6xdDj Here: #include <xmmintrin.h> auto foo() { __m64 a{64}; return _mm_cvtpu8_ps(a); } _mm_cvtpu8_ps is implemented on top of _mm_setzero_si64 and if __m64 is within the function body generates the following code, which uses pxor: foo(): # @foo() mov eax, 255 movd xmm0, eax mov eax, 64 movd xmm1, eax pand xmm1, xmm0 movq qword ptr [rsp - 24], xmm1 pxor xmm0, xmm0 movq qword ptr [rsp - 16], xmm0 movq mm0, qword ptr [rsp - 24] punpcklbw mm0, qword ptr [rsp - 16] # mm0 = mm0[0],mem[0],mm0[1],mem[1],mm0[2],mem[2],mm0[3],mem[3] movq qword ptr [rsp - 8], xmm0 movq mm1, qword ptr [rsp - 8] pcmpgtw mm1, mm0 movq mm2, mm0 punpckhwd mm2, mm1 # mm2 = mm2[2],mm1[2],mm2[3],mm1[3] cvtpi2ps xmm0, mm2 movlhps xmm0, xmm0 # xmm0 = xmm0[0,0] punpcklwd mm0, mm1 # mm0 = mm0[0],mm1[0],mm0[1],mm1[1] cvtpi2ps xmm0, mm0 ret However, if I move __m64 a to a function argument, then the dissasembly uses xorps because of what Eli Friedman mentioned above. The __m64 in a function argument is required to be in an SSE register by the calling convention, and moving it to an MMX register and then doing pxor is slower than just doing a xorps.
But it generated a pxor to an xmm register not an mmx register. And then it spilled to stack and reloaded into an mmx register. So it’s still not ideal or to spec.
(In reply to Craig Topper from comment #10) > But it generated a pxor to an xmm register not an mmx register. And then it > spilled to stack and reloaded into an mmx register. So it’s still not ideal > or to spec. https://reviews.llvm.org/D41908 should help with this
(In reply to Simon Pilgrim from comment #11) > (In reply to Craig Topper from comment #10) > > But it generated a pxor to an xmm register not an mmx register. And then it > > spilled to stack and reloaded into an mmx register. So it’s still not ideal > > or to spec. > > https://reviews.llvm.org/D41908 should help with this Committed at rL322525
Thanks!
A similar issue exists with _mm256_xor_si256: This code (see it live: https://godbolt.org/g/jJHBMi): #include <immintrin.h> __attribute__((__always_inline__, __nodebug__, __target__("avx2"))) __m256i foo(__m256i a, __m256i b) { auto c = _mm256_add_epi64(a, b); return _mm256_xor_si256(a, c); } __attribute__((__always_inline__, __nodebug__, __target__("avx2"))) __m256i bar(__m256i a, __m256i b) { return _mm256_xor_si256(a, b); } generates this assembly: foo(long long __vector(4), long long __vector(4)): # @foo(long long __vector(4), long long __vector(4)) push rbp mov rbp, rsp and rsp, -32 sub rsp, 32 vmovdqa ymm0, ymmword ptr [rbp + 16] vpaddq ymm1, ymm0, ymmword ptr [rbp + 48] vpxor ymm0, ymm1, ymm0 mov rsp, rbp pop rbp ret bar(long long __vector(4), long long __vector(4)): # @bar(long long __vector(4), long long __vector(4)) push rbp mov rbp, rsp and rsp, -32 sub rsp, 32 vmovaps ymm0, ymmword ptr [rbp + 48] vxorps ymm0, ymm0, ymmword ptr [rbp + 16] mov rsp, rbp pop rbp ret So it looks to me that LLVM/clang are choosing vpxor if the operations happen on the integer domain and vxorps otherwise. However, I still do wonder why is vxorps prefered by default on integer vectors instead of vpxor. Shouldn't it be the other way around? Modern CPUs have more ALUs for vector integers operations than for floating point (~3 vs 1). Why is `vxorps` preferred over `vpxor` for integer vectors when no other integer operations happen?
This is somewhat of a historical artifact of the fact that SSE xorps has a shorter encoding than SSE2 pxor. Someone needs to finish driving this long stale patch through https://reviews.llvm.org/D7401
I've pinged chandler on IRC but maybe someone with phabricator access could ping him again on the review?
(In reply to Gonzalo BG from comment #16) > I've pinged chandler on IRC but maybe someone with phabricator access could > ping him again on the review? Please raise a new bug for this if its a problem - this bug was about MMX (and was closed as invalid....), although we did end doing some improvements on the back of it. The x86 domain code has seen a lot of work since Feb 2015 so D7401 is going to need a considerable rewrite.
I've filled https://bugs.llvm.org/show_bug.cgi?id=36127