New user self-registration is disabled due to spam. For an account please email bugs-admin@lists.llvm.org with your e-mail address and full name.

Bug 22563 - Incorrect code generation with arrays of __m256 variables
Summary: Incorrect code generation with arrays of __m256 variables
Status: RESOLVED FIXED
Alias: None
Product: clang
Classification: Unclassified
Component: LLVM Codegen (show other bugs)
Version: 3.6
Hardware: PC Linux
: P normal
Assignee: Unassigned Clang Bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-02-12 06:32 PST by jasonr
Modified: 2015-02-17 16:01 PST (History)
8 users (show)

See Also:
Fixed By Commit(s):


Attachments
C++ source file that demonstrates the issue (1.02 KB, text/x-c++src)
2015-02-12 06:32 PST, jasonr
Details

Note You need to log in before you can comment on or make changes to this bug.
Description jasonr 2015-02-12 06:32:07 PST
Created attachment 13854 [details]
C++ source file that demonstrates the issue

NOTE: This was originally posted on Stack Overflow[1]. After getting some comcurrence that this is likely a clang/LLVM bug, I posted it here.

I'm encountering what appears to be a bug causing incorrect code generation with clang 3.4, 3.5, and 3.6. The source that actually triggered the problem is quite complicated, but I've been able to reduce it to a self-contained example that is attached to this report.

A summary of the code: I have a simple type called `simd_pack` that contains one member, an array of one `__m256i` value. In my application, there are operators and functions that take these types, but the problem can be illustrated by the above example. Specifically, `test_broken()` should read from the `in1` array and then just copy its value over to the `out` array. Therefore, the call to `memcmp()` in `main()` should return zero. I compile the above using the following:

    clang++-3.6 bug_test.cc -o bug_test -mavx -O3

I find that on optimization levels `-O0` and `-O1`, the test passes, while on levels `-O2` and `-O3`, the test fails. I've tried compiling the same file with gcc 4.4, 4.6, 4.7, and 4.8, as well as Intel C++ 13.0, and the test passes on all optimization levels.

Taking a closer look at the generated code, here's the assembly generated on optimization level `-O3`:

    0000000000400a40 <test_broken(signed char*, signed char*, unsigned long)>:
      400a40:       55                      push   %rbp
      400a41:       48 89 e5                mov    %rsp,%rbp
      400a44:       48 81 e4 e0 ff ff ff    and    $0xffffffffffffffe0,%rsp
      400a4b:       48 83 ec 40             sub    $0x40,%rsp
      400a4f:       48 83 fa 20             cmp    $0x20,%rdx
      400a53:       72 2f                   jb     400a84 <test_broken(signed char*, signed char*, unsigned long)+0x44>
      400a55:       31 c0                   xor    %eax,%eax
      400a57:       66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
      400a5e:       00 00 
      400a60:       c5 fc 10 04 06          vmovups (%rsi,%rax,1),%ymm0
      400a65:       c5 f8 29 04 24          vmovaps %xmm0,(%rsp)
      400a6a:       c5 fc 28 04 24          vmovaps (%rsp),%ymm0
      400a6f:       c5 fc 11 04 07          vmovups %ymm0,(%rdi,%rax,1)
      400a74:       48 8d 48 20             lea    0x20(%rax),%rcx
      400a78:       48 83 c0 3f             add    $0x3f,%rax
      400a7c:       48 39 d0                cmp    %rdx,%rax
      400a7f:       48 89 c8                mov    %rcx,%rax
      400a82:       72 dc                   jb     400a60 <test_broken(signed char*, signed char*, unsigned long)+0x20>
      400a84:       48 89 ec                mov    %rbp,%rsp
      400a87:       5d                      pop    %rbp
      400a88:       c5 f8 77                vzeroupper 
      400a8b:       c3                      retq   
      400a8c:       0f 1f 40 00             nopl   0x0(%rax)

I'll reproduce the key part for emphasis:

      400a60:       c5 fc 10 04 06          vmovups (%rsi,%rax,1),%ymm0
      400a65:       c5 f8 29 04 24          vmovaps %xmm0,(%rsp)
      400a6a:       c5 fc 28 04 24          vmovaps (%rsp),%ymm0
      400a6f:       c5 fc 11 04 07          vmovups %ymm0,(%rdi,%rax,1)

The generated code is strange. It first loads 256 bits into `ymm0` using the unaligned move that I asked for, then it stores `xmm0` (which only contains the lower 128 bits of the data that was read) to the stack, then immediately reads 256 bits into `ymm0` from the stack location that was just written to. The effect is that `ymm0`'s upper 128 bits (which get written to the output buffer) are garbage, causing the test to fail.

Are there any particular optimization steps that could be disabled to work around this issue, or a different way to express my intent in code that might not trigger it? I apologize for the lack of a reduced bitcode test case as explained here[2], but I'm not familiar enough with the toolchain to drive the tools properly.

[1]: http://stackoverflow.com/questions/28462707/is-this-incorrect-code-generation-with-arrays-of-m256-values-a-clang-bug
[2]: http://llvm.org/docs/HowToSubmitABug.html
Comment 1 jasonr 2015-02-12 06:34:51 PST
I forgot to note in my original message, but if it's relevant, I have observed this on Ubuntu Linux 14.04 and Mac OS. On Linux, I've tried LLVM 3.4-3.6 from the binary packages on llvm.org, and on Mac OS, I've used Apple's clang 3.5-based build that is distributed with XCode.
Comment 2 Sanjay Patel 2015-02-12 11:13:06 PST
#include <immintrin.h>
#include <string.h>
#include <stdio.h>

struct v
{
    __m256 val[1];
};

struct v load(int8_t *p)
{
    struct v pack;
    pack.val[0] = _mm256_loadu_ps((float *)(p));
    return pack;
}

int main()
{
    int8_t in_buf[32];
    for (int i=0; i<32; i++)
       in_buf[i] = i;

    struct v my_32_bytes;

    memset(&my_32_bytes.val[0], 0, 32);

    my_32_bytes = load(in_buf);

    int8_t *p = (int8_t *)my_32_bytes.val;
    for(int i=0; i<32; i++)
       printf("%d ", p[i]);
    printf("\n");
    return 0;
}

$ ./clang -mavx -O0 orig.c
$ ./a.out 
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 104 -80 -73 12 1 0 0 0 -112 33 -23 105 -1 127 0 0
Comment 3 Sanjay Patel 2015-02-12 11:14:11 PST
Looks like something terrible is happening in the front-end:

define <2 x double> @load(i8* %p) #0 {
Comment 4 Sanjay Patel 2015-02-12 11:40:04 PST
Reducing further:

#include <immintrin.h>

struct v
{
    __m256 val[1];
};

struct v load(float *p)
{
    struct v my_32_bytes;
    my_32_bytes.val[0] = _mm256_loadu_ps(p);
    return my_32_bytes;
}

-------------------------------------------------------------------

It seems necessary for 'val' to be an array to trigger the bug.

$ ./clang bogus.c -mavx  -S -emit-llvm -o -
...
  %coerce.dive = getelementptr %struct.v* %retval, i32 0, i32 0
  %6 = bitcast [1 x <8 x float>]* %coerce.dive to <2 x double>*
  %7 = load <2 x double>* %6, align 1
  ret <2 x double> %7
Comment 5 Sanjay Patel 2015-02-12 11:57:27 PST
Minimized, I hope:

#include <immintrin.h>
struct v {
   __m256 val[1];
} vecarray;

struct v load() {
    return vecarray;
}

This does a copy of the global and returns a <2 x double> instead of a <4 x double>. 

To trigger the bug, it seems necessary to have:

1. A __m256 member type
2. An array of size 1
3. A struct

Probably need some front-end / ABI help here...I've never looked at how this stuff works.
Comment 6 jasonr 2015-02-12 12:21:18 PST
This certainly doesn't help to minimize the case, but I've done some experimentation to see if a workaround is possible by wrapping the `__m256` array in some way. I found that if I change `val` to be a `boost::array<__m256i, 1>`, then the generated code looks correct (at least the disassembly; I'm not fluent in LLVM intermediate code), at optimization levels -O1 through -O3. Surprisingly, -O0 does not work.

However, libc++'s `std::array<__m256i, 1>` does not work at any optimization level.

I hope this is helpful in some way.
Comment 7 Sanjay Patel 2015-02-12 12:38:49 PST
(In reply to comment #6) 
> I hope this is helpful in some way.

Thanks! I think the fact that it appears to work at higher optimization levels is a fluke. It's pretty clear that we have a front-end bug here (there might be optimizer/backend bugs too, but we'll have to see).

I try to avoid the front-end as much as possible, but the "coerce.dive" in the IR is a nice clue where the bug is. Unfortunately, I can't get Xcode to stop on a breakpoint...so I'm debugging via printf. :)
Comment 8 Sanjay Patel 2015-02-12 17:24:31 PST
Hans - I think this bug should be considered a 3.6 blocker. It's probably a 1-line fix somewhere in the front-end, but I'm not familiar with that code and probably won't have this solved quick enough. Can we raise some awareness of this bug with the cfe folks?
Comment 9 Hans Wennborg 2015-02-12 17:34:24 PST
(In reply to comment #8)
> Hans - I think this bug should be considered a 3.6 blocker. It's probably a
> 1-line fix somewhere in the front-end, but I'm not familiar with that code
> and probably won't have this solved quick enough. Can we raise some
> awareness of this bug with the cfe folks?

I would consider merging a fix for this if we had one and it was straight-forward, but I will not consider it a blocker since according to the report it's not a regression from 3.5.
Comment 10 Sanjay Patel 2015-02-12 22:41:30 PST
I finally found my way to and through the x86-64 ABI implementation, and I think this clang patch will solve the bug. We aren't handling structs of arrays correctly:

Index: lib/CodeGen/TargetInfo.cpp
===================================================================
--- lib/CodeGen/TargetInfo.cpp	(revision 228840)
+++ lib/CodeGen/TargetInfo.cpp	(working copy)
@@ -2179,20 +2179,16 @@
   return ABIArgInfo::getIndirect(Align);
 }
 
-/// GetByteVectorType - The ABI specifies that a value should be passed in an
-/// full vector XMM/YMM register.  Pick an LLVM IR type that will be passed as a
-/// vector register.
+/// The ABI specifies that a value should be passed in a full vector XMM/YMM
+/// register. Pick an LLVM IR type that will be passed as a vector register.
 llvm::Type *X86_64ABIInfo::GetByteVectorType(QualType Ty) const {
+  // Wrapper structs/arrays that only contain vectors are passed just like
+  // vectors, strip them off if present.
+  if (const Type *InnerTy = isSingleElementStruct(Ty, getContext()))
+    Ty = QualType(InnerTy, 0);
+
   llvm::Type *IRType = CGT.ConvertType(Ty);
 
-  // Wrapper structs that just contain vectors are passed just like vectors,
-  // strip them off if present.
-  llvm::StructType *STy = dyn_cast<llvm::StructType>(IRType);
-  while (STy && STy->getNumElements() == 1) {
-    IRType = STy->getElementType(0);
-    STy = dyn_cast<llvm::StructType>(IRType);
-  }
-
   // If the preferred type is a 16-byte vector, prefer to pass it.
   if (llvm::VectorType *VT = dyn_cast<llvm::VectorType>(IRType)){
     llvm::Type *EltTy = VT->getElementType();
Comment 11 jasonr 2015-02-13 08:04:20 PST
Thanks for digging through to find a patch! I pulled the latest LLVM/clang source from SVN and applied the patch. I can confirm that it fixed the issue in my example code and in the larger application that I took it from. Hopefully this is simple enough to be considered for the 3.6 release.

Maybe this is asking too much, but based on your read of the section that you patched, do you see any way that I could massage the example code to work around the issue? I'm wondering if there's a way I can make my code work on previous versions as well.

Again, thank you very much for your work on this.
Comment 12 Sanjay Patel 2015-02-13 09:07:48 PST
Thanks for testing that out. I was just about to ask if the actual program matched the test file that you provided...it seems odd to have an array of 1. :)

The trigger for the bug is a struct that contains an array of size 1 of a 256-bit vector:

struct simd_pack
{
    enum { num_vectors = 1 };
    __m256i _val[num_vectors];
};

If I've understood the ABI and the clang implementation correctly, if you change num_vectors to anything higher than 1, you shouldn't hit the bug.
Comment 13 jasonr 2015-02-13 09:11:03 PST
In my actual code, `num_vectors` is calculated based on some C++ template parameters to the `simd_pack` type. In many cases, that comes out to be 1, but it also is often greater than 1. Your observation gives me an idea, though; I could try to introduce a template specialization that catches the case where `num_vectors == 1`. It could instead just use a single `__m256` member instead of an array of size 1. I'll have to check to see how feasible that is.

Thanks!
Comment 14 Hal Finkel 2015-02-13 09:28:42 PST
(In reply to comment #10)
> I finally found my way to and through the x86-64 ABI implementation, and I
> think this clang patch will solve the bug. We aren't handling structs of
> arrays correctly:
> 
> Index: lib/CodeGen/TargetInfo.cpp
> ===================================================================
> --- lib/CodeGen/TargetInfo.cpp	(revision 228840)
> +++ lib/CodeGen/TargetInfo.cpp	(working copy)
> @@ -2179,20 +2179,16 @@
>    return ABIArgInfo::getIndirect(Align);
>  }
>  
> -/// GetByteVectorType - The ABI specifies that a value should be passed in
> an
> -/// full vector XMM/YMM register.  Pick an LLVM IR type that will be passed
> as a
> -/// vector register.
> +/// The ABI specifies that a value should be passed in a full vector XMM/YMM
> +/// register. Pick an LLVM IR type that will be passed as a vector register.
>  llvm::Type *X86_64ABIInfo::GetByteVectorType(QualType Ty) const {
> +  // Wrapper structs/arrays that only contain vectors are passed just like
> +  // vectors, strip them off if present.
> +  if (const Type *InnerTy = isSingleElementStruct(Ty, getContext()))
> +    Ty = QualType(InnerTy, 0);
> +


This certainly looks similar to the code that PPC uses. I recommend you post it for review.
Comment 15 Sanjay Patel 2015-02-13 09:55:53 PST
Thanks, Hal. Patch with test case posted here:
http://reviews.llvm.org/D7614
Comment 16 Sanjay Patel 2015-02-16 12:03:29 PST
The patch was checked into trunk here:
http://reviews.llvm.org/rL229408

If it's not too late and a maintainer approves, I think it would be worthwhile to merge into 3.6.

Thanks Hal for the quick review. And also for raising further questions about the x86-64 ABI spec and implementation. Possibly more patches to come based on the outcome of this:
https://groups.google.com/forum/?hl=en#!topic/x86-64-abi/k9lgQogMZqw
Comment 17 Hans Wennborg 2015-02-17 15:32:29 PST
(In reply to comment #16)
> The patch was checked into trunk here:
> http://reviews.llvm.org/rL229408
> 
> If it's not too late and a maintainer approves, I think it would be
> worthwhile to merge into 3.6.

Merged in r229546.
Comment 18 Sanjay Patel 2015-02-17 16:01:44 PST
(In reply to comment #17)
> Merged in r229546.

Thanks!