LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 36732 - llvm.experimental.vector.reduce.{fadd,fmul} select fails when passed a non-undef accumulator
Summary: llvm.experimental.vector.reduce.{fadd,fmul} select fails when passed a non-un...
Status: RESOLVED FIXED
Alias: None
Product: new-bugs
Classification: Unclassified
Component: new bugs (show other bugs)
Version: unspecified
Hardware: PC All
: P enhancement
Assignee: Simon Pilgrim
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-03-14 08:13 PDT by Gonzalo BG
Modified: 2018-04-09 09:03 PDT (History)
5 users (show)

See Also:
Fixed By Commit(s): 329585


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Gonzalo BG 2018-03-14 08:13:20 PDT
The docs of the llvm.experimental.vector.reduce.{fadd,fmul} intrinsics state:

> If the intrinsic call has fast-math flags, then the reduction will not preserve the associativity of an equivalent scalarized counterpart. If it does not have fast-math flags, then the reduction will be ordered, implying that the operation respects the associativity of a scalarized reduction.
> 
> Arguments:
> The first argument to this intrinsic is a scalar accumulator value, which is only used when there are no fast-math flags attached. This argument may be undef when fast-math flags are used.

While undef + fast-math flags works fine, I haven't been able to get the non-fast-math + accumulator version to work. They fail to select. I'm using LLVM6 with assertions enabled.

For example:

declare float @llvm.experimental.vector.reduce.fadd.f32.f32.v4f32(float, <4 x float>)
define internal float @_ZN32simd_intrinsic_generic_reduction3foo17ha7e2b586cf5567bdE(<4 x float>* noalias nocapture dereferenceable(16)) unnamed_addr #0 {
  %2 = alloca float, align 4
  %3 = load <4 x float>, <4 x float>* %0, align 16
  %4 = call float @llvm.experimental.vector.reduce.fadd.f32.f32.v4f32(float 1.000000e+00, <4 x float> %3)
  store float %4, float* %2, align 4
  %5 = load float, float* %2, align 4
  br label %6

; <label>:6:                                      ; preds = %1
  ret float %5
}

produces 

LLVM ERROR: Cannot select: 0x3f478b0: f32 = vecreduce_strict_fadd 0x3f479e8, 0x3f477e0
  0x3f479e8: f32,ch = load<LD4[ConstantPool]> 0x3ea4c28, 0x3f47b88, undef:i64
    0x3f47b88: i64 = X86ISD::Wrapper TargetConstantPool:i64<float 1.000000e+00> 0
      0x3f47848: i64 = TargetConstantPool<float 1.000000e+00> 0
    0x3f47778: i64 = undef
  0x3f477e0: v4f32,ch = load<LD16[%0](dereferenceable)> 0x3ea4c28, 0x3f476a8, undef:i64
    0x3f476a8: i64,ch = CopyFromReg 0x3ea4c28, Register:i64 %1
      0x3f47640: i64 = Register %1
    0x3f47778: i64 = undef
In function: _ZN32simd_intrinsic_generic_reduction3foo17ha7e2b586cf5567bdE
Compiler returned: 1

and

declare float @llvm.experimental.vector.reduce.fmul.f32.f32.v4f32(float, <4 x float>)
define internal float @_ZN32simd_intrinsic_generic_reduction3foo17ha7e2b586cf5567bdE(<4 x float>* noalias nocapture dereferenceable(16)) unnamed_addr #0 {
  %2 = alloca float, align 4
  %3 = load <4 x float>, <4 x float>* %0, align 16
  %4 = call float @llvm.experimental.vector.reduce.fmul.f32.f32.v4f32(float 1.000000e+00, <4 x float> %3)
  store float %4, float* %2, align 4
  %5 = load float, float* %2, align 4
  br label %6

; <label>:6:                                      ; preds = %1
  ret float %5
}

produces this

LLVM ERROR: Cannot select: 0x4da88a0: f32 = vecreduce_strict_fmul 0x4da89d8, 0x4da87d0
  0x4da89d8: f32,ch = load<LD4[ConstantPool]> 0x4d05c28, 0x4da8b78, undef:i64
    0x4da8b78: i64 = X86ISD::Wrapper TargetConstantPool:i64<float 1.000000e+00> 0
      0x4da8838: i64 = TargetConstantPool<float 1.000000e+00> 0
    0x4da8768: i64 = undef
  0x4da87d0: v4f32,ch = load<LD16[%0](dereferenceable)> 0x4d05c28, 0x4da8698, undef:i64
    0x4da8698: i64,ch = CopyFromReg 0x4d05c28, Register:i64 %1
      0x4da8630: i64 = Register %1
    0x4da8768: i64 = undef
In function: _ZN32simd_intrinsic_generic_reduction3foo17ha7e2b586cf5567bdE
Compiler returned: 1
Comment 1 Gonzalo BG 2018-03-14 08:21:47 PDT
FWIW I haven't been able to find a test for the combination "non-fast-math" + "non-undef-accumulator" in neither of these files:

https://github.com/llvm-mirror/llvm/blob/4604874612fa292ab4c49f96aedefdf8be1ff27e/test/CodeGen/Generic/expand-experimental-reductions.ll

and there is a TODO here: https://github.com/llvm-mirror/llvm/blob/4604874612fa292ab4c49f96aedefdf8be1ff27e/lib/CodeGen/ExpandReductions.cpp#L96

so maybe this is a feature request instead of a bug report, in any case, the docs should say what works and what doesn't.

The current wording seems to suggest that fast call should only be used when passing undef and that otherwise it does nothing.
Comment 2 Gonzalo BG 2018-04-03 03:26:24 PDT
This Rust bug now depend on this being fixed or a work around this bug being added: https://github.com/rust-lang-nursery/stdsimd/issues/409
Comment 3 Simon Pilgrim 2018-04-06 06:53:09 PDT
https://reviews.llvm.org/D45366
Comment 4 Simon Pilgrim 2018-04-09 09:03:07 PDT
rL329585