Created attachment 13854 [details] C++ source file that demonstrates the issue NOTE: This was originally posted on Stack Overflow[1]. After getting some comcurrence that this is likely a clang/LLVM bug, I posted it here. I'm encountering what appears to be a bug causing incorrect code generation with clang 3.4, 3.5, and 3.6. The source that actually triggered the problem is quite complicated, but I've been able to reduce it to a self-contained example that is attached to this report. A summary of the code: I have a simple type called `simd_pack` that contains one member, an array of one `__m256i` value. In my application, there are operators and functions that take these types, but the problem can be illustrated by the above example. Specifically, `test_broken()` should read from the `in1` array and then just copy its value over to the `out` array. Therefore, the call to `memcmp()` in `main()` should return zero. I compile the above using the following: clang++-3.6 bug_test.cc -o bug_test -mavx -O3 I find that on optimization levels `-O0` and `-O1`, the test passes, while on levels `-O2` and `-O3`, the test fails. I've tried compiling the same file with gcc 4.4, 4.6, 4.7, and 4.8, as well as Intel C++ 13.0, and the test passes on all optimization levels. Taking a closer look at the generated code, here's the assembly generated on optimization level `-O3`: 0000000000400a40 <test_broken(signed char*, signed char*, unsigned long)>: 400a40: 55 push %rbp 400a41: 48 89 e5 mov %rsp,%rbp 400a44: 48 81 e4 e0 ff ff ff and $0xffffffffffffffe0,%rsp 400a4b: 48 83 ec 40 sub $0x40,%rsp 400a4f: 48 83 fa 20 cmp $0x20,%rdx 400a53: 72 2f jb 400a84 <test_broken(signed char*, signed char*, unsigned long)+0x44> 400a55: 31 c0 xor %eax,%eax 400a57: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 400a5e: 00 00 400a60: c5 fc 10 04 06 vmovups (%rsi,%rax,1),%ymm0 400a65: c5 f8 29 04 24 vmovaps %xmm0,(%rsp) 400a6a: c5 fc 28 04 24 vmovaps (%rsp),%ymm0 400a6f: c5 fc 11 04 07 vmovups %ymm0,(%rdi,%rax,1) 400a74: 48 8d 48 20 lea 0x20(%rax),%rcx 400a78: 48 83 c0 3f add $0x3f,%rax 400a7c: 48 39 d0 cmp %rdx,%rax 400a7f: 48 89 c8 mov %rcx,%rax 400a82: 72 dc jb 400a60 <test_broken(signed char*, signed char*, unsigned long)+0x20> 400a84: 48 89 ec mov %rbp,%rsp 400a87: 5d pop %rbp 400a88: c5 f8 77 vzeroupper 400a8b: c3 retq 400a8c: 0f 1f 40 00 nopl 0x0(%rax) I'll reproduce the key part for emphasis: 400a60: c5 fc 10 04 06 vmovups (%rsi,%rax,1),%ymm0 400a65: c5 f8 29 04 24 vmovaps %xmm0,(%rsp) 400a6a: c5 fc 28 04 24 vmovaps (%rsp),%ymm0 400a6f: c5 fc 11 04 07 vmovups %ymm0,(%rdi,%rax,1) The generated code is strange. It first loads 256 bits into `ymm0` using the unaligned move that I asked for, then it stores `xmm0` (which only contains the lower 128 bits of the data that was read) to the stack, then immediately reads 256 bits into `ymm0` from the stack location that was just written to. The effect is that `ymm0`'s upper 128 bits (which get written to the output buffer) are garbage, causing the test to fail. Are there any particular optimization steps that could be disabled to work around this issue, or a different way to express my intent in code that might not trigger it? I apologize for the lack of a reduced bitcode test case as explained here[2], but I'm not familiar enough with the toolchain to drive the tools properly. [1]: http://stackoverflow.com/questions/28462707/is-this-incorrect-code-generation-with-arrays-of-m256-values-a-clang-bug [2]: http://llvm.org/docs/HowToSubmitABug.html
I forgot to note in my original message, but if it's relevant, I have observed this on Ubuntu Linux 14.04 and Mac OS. On Linux, I've tried LLVM 3.4-3.6 from the binary packages on llvm.org, and on Mac OS, I've used Apple's clang 3.5-based build that is distributed with XCode.
#include <immintrin.h> #include <string.h> #include <stdio.h> struct v { __m256 val[1]; }; struct v load(int8_t *p) { struct v pack; pack.val[0] = _mm256_loadu_ps((float *)(p)); return pack; } int main() { int8_t in_buf[32]; for (int i=0; i<32; i++) in_buf[i] = i; struct v my_32_bytes; memset(&my_32_bytes.val[0], 0, 32); my_32_bytes = load(in_buf); int8_t *p = (int8_t *)my_32_bytes.val; for(int i=0; i<32; i++) printf("%d ", p[i]); printf("\n"); return 0; } $ ./clang -mavx -O0 orig.c $ ./a.out 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 104 -80 -73 12 1 0 0 0 -112 33 -23 105 -1 127 0 0
Looks like something terrible is happening in the front-end: define <2 x double> @load(i8* %p) #0 {
Reducing further: #include <immintrin.h> struct v { __m256 val[1]; }; struct v load(float *p) { struct v my_32_bytes; my_32_bytes.val[0] = _mm256_loadu_ps(p); return my_32_bytes; } ------------------------------------------------------------------- It seems necessary for 'val' to be an array to trigger the bug. $ ./clang bogus.c -mavx -S -emit-llvm -o - ... %coerce.dive = getelementptr %struct.v* %retval, i32 0, i32 0 %6 = bitcast [1 x <8 x float>]* %coerce.dive to <2 x double>* %7 = load <2 x double>* %6, align 1 ret <2 x double> %7
Minimized, I hope: #include <immintrin.h> struct v { __m256 val[1]; } vecarray; struct v load() { return vecarray; } This does a copy of the global and returns a <2 x double> instead of a <4 x double>. To trigger the bug, it seems necessary to have: 1. A __m256 member type 2. An array of size 1 3. A struct Probably need some front-end / ABI help here...I've never looked at how this stuff works.
This certainly doesn't help to minimize the case, but I've done some experimentation to see if a workaround is possible by wrapping the `__m256` array in some way. I found that if I change `val` to be a `boost::array<__m256i, 1>`, then the generated code looks correct (at least the disassembly; I'm not fluent in LLVM intermediate code), at optimization levels -O1 through -O3. Surprisingly, -O0 does not work. However, libc++'s `std::array<__m256i, 1>` does not work at any optimization level. I hope this is helpful in some way.
(In reply to comment #6) > I hope this is helpful in some way. Thanks! I think the fact that it appears to work at higher optimization levels is a fluke. It's pretty clear that we have a front-end bug here (there might be optimizer/backend bugs too, but we'll have to see). I try to avoid the front-end as much as possible, but the "coerce.dive" in the IR is a nice clue where the bug is. Unfortunately, I can't get Xcode to stop on a breakpoint...so I'm debugging via printf. :)
Hans - I think this bug should be considered a 3.6 blocker. It's probably a 1-line fix somewhere in the front-end, but I'm not familiar with that code and probably won't have this solved quick enough. Can we raise some awareness of this bug with the cfe folks?
(In reply to comment #8) > Hans - I think this bug should be considered a 3.6 blocker. It's probably a > 1-line fix somewhere in the front-end, but I'm not familiar with that code > and probably won't have this solved quick enough. Can we raise some > awareness of this bug with the cfe folks? I would consider merging a fix for this if we had one and it was straight-forward, but I will not consider it a blocker since according to the report it's not a regression from 3.5.
I finally found my way to and through the x86-64 ABI implementation, and I think this clang patch will solve the bug. We aren't handling structs of arrays correctly: Index: lib/CodeGen/TargetInfo.cpp =================================================================== --- lib/CodeGen/TargetInfo.cpp (revision 228840) +++ lib/CodeGen/TargetInfo.cpp (working copy) @@ -2179,20 +2179,16 @@ return ABIArgInfo::getIndirect(Align); } -/// GetByteVectorType - The ABI specifies that a value should be passed in an -/// full vector XMM/YMM register. Pick an LLVM IR type that will be passed as a -/// vector register. +/// The ABI specifies that a value should be passed in a full vector XMM/YMM +/// register. Pick an LLVM IR type that will be passed as a vector register. llvm::Type *X86_64ABIInfo::GetByteVectorType(QualType Ty) const { + // Wrapper structs/arrays that only contain vectors are passed just like + // vectors, strip them off if present. + if (const Type *InnerTy = isSingleElementStruct(Ty, getContext())) + Ty = QualType(InnerTy, 0); + llvm::Type *IRType = CGT.ConvertType(Ty); - // Wrapper structs that just contain vectors are passed just like vectors, - // strip them off if present. - llvm::StructType *STy = dyn_cast<llvm::StructType>(IRType); - while (STy && STy->getNumElements() == 1) { - IRType = STy->getElementType(0); - STy = dyn_cast<llvm::StructType>(IRType); - } - // If the preferred type is a 16-byte vector, prefer to pass it. if (llvm::VectorType *VT = dyn_cast<llvm::VectorType>(IRType)){ llvm::Type *EltTy = VT->getElementType();
Thanks for digging through to find a patch! I pulled the latest LLVM/clang source from SVN and applied the patch. I can confirm that it fixed the issue in my example code and in the larger application that I took it from. Hopefully this is simple enough to be considered for the 3.6 release. Maybe this is asking too much, but based on your read of the section that you patched, do you see any way that I could massage the example code to work around the issue? I'm wondering if there's a way I can make my code work on previous versions as well. Again, thank you very much for your work on this.
Thanks for testing that out. I was just about to ask if the actual program matched the test file that you provided...it seems odd to have an array of 1. :) The trigger for the bug is a struct that contains an array of size 1 of a 256-bit vector: struct simd_pack { enum { num_vectors = 1 }; __m256i _val[num_vectors]; }; If I've understood the ABI and the clang implementation correctly, if you change num_vectors to anything higher than 1, you shouldn't hit the bug.
In my actual code, `num_vectors` is calculated based on some C++ template parameters to the `simd_pack` type. In many cases, that comes out to be 1, but it also is often greater than 1. Your observation gives me an idea, though; I could try to introduce a template specialization that catches the case where `num_vectors == 1`. It could instead just use a single `__m256` member instead of an array of size 1. I'll have to check to see how feasible that is. Thanks!
(In reply to comment #10) > I finally found my way to and through the x86-64 ABI implementation, and I > think this clang patch will solve the bug. We aren't handling structs of > arrays correctly: > > Index: lib/CodeGen/TargetInfo.cpp > =================================================================== > --- lib/CodeGen/TargetInfo.cpp (revision 228840) > +++ lib/CodeGen/TargetInfo.cpp (working copy) > @@ -2179,20 +2179,16 @@ > return ABIArgInfo::getIndirect(Align); > } > > -/// GetByteVectorType - The ABI specifies that a value should be passed in > an > -/// full vector XMM/YMM register. Pick an LLVM IR type that will be passed > as a > -/// vector register. > +/// The ABI specifies that a value should be passed in a full vector XMM/YMM > +/// register. Pick an LLVM IR type that will be passed as a vector register. > llvm::Type *X86_64ABIInfo::GetByteVectorType(QualType Ty) const { > + // Wrapper structs/arrays that only contain vectors are passed just like > + // vectors, strip them off if present. > + if (const Type *InnerTy = isSingleElementStruct(Ty, getContext())) > + Ty = QualType(InnerTy, 0); > + This certainly looks similar to the code that PPC uses. I recommend you post it for review.
Thanks, Hal. Patch with test case posted here: http://reviews.llvm.org/D7614
The patch was checked into trunk here: http://reviews.llvm.org/rL229408 If it's not too late and a maintainer approves, I think it would be worthwhile to merge into 3.6. Thanks Hal for the quick review. And also for raising further questions about the x86-64 ABI spec and implementation. Possibly more patches to come based on the outcome of this: https://groups.google.com/forum/?hl=en#!topic/x86-64-abi/k9lgQogMZqw
(In reply to comment #16) > The patch was checked into trunk here: > http://reviews.llvm.org/rL229408 > > If it's not too late and a maintainer approves, I think it would be > worthwhile to merge into 3.6. Merged in r229546.
(In reply to comment #17) > Merged in r229546. Thanks!