Perhaps this is already a known issue, but Clang/LLVM trunk does not vectorize the inner matmult loop unless the "#pragma unroll" is enabled: void f(int * __restrict__ a, int * __restrict__ b, int * __restrict__ r) { for (int m = 0; m < 64; m++) { int c = 0; // #pragma unroll for (int i = 0; i < 32; i++) { c += a[i] * b[m * 32 + i]; } r[m] = c; } } It looks like the loop unroller fully unrolls the inner loop and the SLP vectorizer is unable to vectorize as well as the Loop vectorizer would have vectorized the not-unrolled loop.
Created attachment 19495 [details] D source code that demonstrates the problem in LLVM 5.0.0 Urolling the loop breaks vectorization in some cases. It depends on size of inner loop. For 16 and 32 items the vectorization is broken. If we accumulate the results of inner loop the vectorization works fine.