The docs of the llvm.experimental.vector.reduce.{fadd,fmul} intrinsics state: > If the intrinsic call has fast-math flags, then the reduction will not preserve the associativity of an equivalent scalarized counterpart. If it does not have fast-math flags, then the reduction will be ordered, implying that the operation respects the associativity of a scalarized reduction. > > Arguments: > The first argument to this intrinsic is a scalar accumulator value, which is only used when there are no fast-math flags attached. This argument may be undef when fast-math flags are used. While undef + fast-math flags works fine, I haven't been able to get the non-fast-math + accumulator version to work. They fail to select. I'm using LLVM6 with assertions enabled. For example: declare float @llvm.experimental.vector.reduce.fadd.f32.f32.v4f32(float, <4 x float>) define internal float @_ZN32simd_intrinsic_generic_reduction3foo17ha7e2b586cf5567bdE(<4 x float>* noalias nocapture dereferenceable(16)) unnamed_addr #0 { %2 = alloca float, align 4 %3 = load <4 x float>, <4 x float>* %0, align 16 %4 = call float @llvm.experimental.vector.reduce.fadd.f32.f32.v4f32(float 1.000000e+00, <4 x float> %3) store float %4, float* %2, align 4 %5 = load float, float* %2, align 4 br label %6 ; <label>:6: ; preds = %1 ret float %5 } produces LLVM ERROR: Cannot select: 0x3f478b0: f32 = vecreduce_strict_fadd 0x3f479e8, 0x3f477e0 0x3f479e8: f32,ch = load<LD4[ConstantPool]> 0x3ea4c28, 0x3f47b88, undef:i64 0x3f47b88: i64 = X86ISD::Wrapper TargetConstantPool:i64<float 1.000000e+00> 0 0x3f47848: i64 = TargetConstantPool<float 1.000000e+00> 0 0x3f47778: i64 = undef 0x3f477e0: v4f32,ch = load<LD16[%0](dereferenceable)> 0x3ea4c28, 0x3f476a8, undef:i64 0x3f476a8: i64,ch = CopyFromReg 0x3ea4c28, Register:i64 %1 0x3f47640: i64 = Register %1 0x3f47778: i64 = undef In function: _ZN32simd_intrinsic_generic_reduction3foo17ha7e2b586cf5567bdE Compiler returned: 1 and declare float @llvm.experimental.vector.reduce.fmul.f32.f32.v4f32(float, <4 x float>) define internal float @_ZN32simd_intrinsic_generic_reduction3foo17ha7e2b586cf5567bdE(<4 x float>* noalias nocapture dereferenceable(16)) unnamed_addr #0 { %2 = alloca float, align 4 %3 = load <4 x float>, <4 x float>* %0, align 16 %4 = call float @llvm.experimental.vector.reduce.fmul.f32.f32.v4f32(float 1.000000e+00, <4 x float> %3) store float %4, float* %2, align 4 %5 = load float, float* %2, align 4 br label %6 ; <label>:6: ; preds = %1 ret float %5 } produces this LLVM ERROR: Cannot select: 0x4da88a0: f32 = vecreduce_strict_fmul 0x4da89d8, 0x4da87d0 0x4da89d8: f32,ch = load<LD4[ConstantPool]> 0x4d05c28, 0x4da8b78, undef:i64 0x4da8b78: i64 = X86ISD::Wrapper TargetConstantPool:i64<float 1.000000e+00> 0 0x4da8838: i64 = TargetConstantPool<float 1.000000e+00> 0 0x4da8768: i64 = undef 0x4da87d0: v4f32,ch = load<LD16[%0](dereferenceable)> 0x4d05c28, 0x4da8698, undef:i64 0x4da8698: i64,ch = CopyFromReg 0x4d05c28, Register:i64 %1 0x4da8630: i64 = Register %1 0x4da8768: i64 = undef In function: _ZN32simd_intrinsic_generic_reduction3foo17ha7e2b586cf5567bdE Compiler returned: 1
FWIW I haven't been able to find a test for the combination "non-fast-math" + "non-undef-accumulator" in neither of these files: https://github.com/llvm-mirror/llvm/blob/4604874612fa292ab4c49f96aedefdf8be1ff27e/test/CodeGen/Generic/expand-experimental-reductions.ll and there is a TODO here: https://github.com/llvm-mirror/llvm/blob/4604874612fa292ab4c49f96aedefdf8be1ff27e/lib/CodeGen/ExpandReductions.cpp#L96 so maybe this is a feature request instead of a bug report, in any case, the docs should say what works and what doesn't. The current wording seems to suggest that fast call should only be used when passing undef and that otherwise it does nothing.
This Rust bug now depend on this being fixed or a work around this bug being added: https://github.com/rust-lang-nursery/stdsimd/issues/409
https://reviews.llvm.org/D45366
rL329585