LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 40206 - After r329339, loop unrolling with long doubles seems to be incorrect
Summary: After r329339, loop unrolling with long doubles seems to be incorrect
Status: RESOLVED FIXED
Alias: None
Product: libraries
Classification: Unclassified
Component: Backend: X86 (show other bugs)
Version: 7.0
Hardware: PC All
: P normal
Assignee: Unassigned LLVM Bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-01-02 11:15 PST by Dimitry Andric
Modified: 2019-02-08 15:31 PST (History)
8 users (show)

See Also:
Fixed By Commit(s): r353489


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dimitry Andric 2019-01-02 11:15:24 PST
As reported in https://bugs.freebsd.org/234040, after the import of clang 7.0 a few math library tests started failing.

Bisection over llvm trunk showed that these failures were introduced via https://reviews.llvm.org/rL329339 ("[X86] Remove some InstRWs for plain store instructions on Sandy Bridge. We were forcing the latency of these instructions to 5 cycles, but every other scheduler model had them as 1 cycle. I'm sure I didn't get everything, but this gets a big portion.")

Test case, minimized from the loop here: https://github.com/freebsd/freebsd/blob/master/lib/msun/ld80/e_rem_pio2l.h#L131

// clang -O2 rempio-min.c -o rempio-min

long double __attribute__((noinline)) rem_pio2l_min(long double z)
{
  int i;
  double tx[2];

  for (i = 0; i < 2; ++i) {
    tx[i] = (double)((int)(z));
    z = (z - tx[i]) * 1.6777216e+07;
  }

  return z;
}

int main(void)
{
  const long double test1 = 0x1.b2f3ee96e7600326p+23L;
  const long double check1 = 0x1.93p+16;
  long double res;

  res = rem_pio2l_min(test1);

  return res == check1 ? 0 : 1;
}

Side-by-side diff of clang r329338 (left) and r329339 (right) assembly output, hoping that bugzilla won't mess it up too badly:

rem_pio2l_min:                            rem_pio2l_min:
.cfi_startproc                            .cfi_startproc
pushq   %rbp                              pushq   %rbp
.cfi_def_cfa_offset 16                    .cfi_def_cfa_offset 16
.cfi_offset %rbp, -16                     .cfi_offset %rbp, -16
movq    %rsp, %rbp                        movq    %rsp, %rbp
.cfi_def_cfa_register %rbp                .cfi_def_cfa_register %rbp
fnstcw  -4(%rbp)                          fnstcw  -4(%rbp)
fldt    16(%rbp)                <
movzwl  -4(%rbp), %eax                    movzwl  -4(%rbp), %eax
movw    $3199, -4(%rbp)                   movw    $3199, -4(%rbp)
fldcw   -4(%rbp)                          fldcw   -4(%rbp)
                                >         fldt    16(%rbp)
movw    %ax, -4(%rbp)                     movw    %ax, -4(%rbp)
fistl   -8(%rbp)                          fistl   -8(%rbp)
fldcw   -4(%rbp)                          fldcw   -4(%rbp)
cvtsi2sdl       -8(%rbp), %xmm0           cvtsi2sdl       -8(%rbp), %xmm0
movsd   %xmm0, -32(%rbp)                  movsd   %xmm0, -32(%rbp)
fsubl   -32(%rbp)                         fsubl   -32(%rbp)
flds    .LCPI0_0(%rip)          <
fnstcw  -2(%rbp)                          fnstcw  -2(%rbp)
fmul    %st(0), %st(1)          |         flds    .LCPI0_0(%rip)
movzwl  -2(%rbp), %eax                    movzwl  -2(%rbp), %eax
movw    $3199, -2(%rbp)                   movw    $3199, -2(%rbp)
fldcw   -2(%rbp)                          fldcw   -2(%rbp)
                                >         fmul    %st(0), %st(1)
movw    %ax, -2(%rbp)                     movw    %ax, -2(%rbp)
fxch    %st(1)                            fxch    %st(1)
fistl   -12(%rbp)                         fistl   -12(%rbp)
fldcw   -2(%rbp)                          fldcw   -2(%rbp)
xorps   %xmm0, %xmm0                      xorps   %xmm0, %xmm0
cvtsi2sdl       -12(%rbp), %xmm0          cvtsi2sdl       -12(%rbp), %xmm0
movsd   %xmm0, -24(%rbp)                  movsd   %xmm0, -24(%rbp)
fsubl   -24(%rbp)                         fsubl   -24(%rbp)
fmulp   %st(1)                            fmulp   %st(1)
popq    %rbp                              popq    %rbp
retq                                      retq
Comment 1 Dimitry Andric 2019-01-02 11:19:43 PST
Btw, there is a similar loop here:

https://github.com/freebsd/freebsd/blob/master/lib/msun/ld80/e_powl.c#L29

which is also unrolled incorrectly, in particular in the instance where it is unrolled 6 times:

https://github.com/freebsd/freebsd/blob/master/lib/msun/ld80/e_powl.c#L482

This is the reason that our powl() tests failed after r329339 too, but this is harder to reduce to a small test case.
Comment 2 Dimitry Andric 2019-01-02 14:26:42 PST
An even simpler representation, which gives the same effect:

long double __attribute__((noinline)) rem_pio2l_min(long double z)
{
  z = (z - (int)z) * 1.6777216e+07;
  z = (z - (int)z) * 1.6777216e+07;

  return z;
}
Comment 3 Patrick Wildt 2019-01-30 15:54:21 PST
We are seeing about the same issue on OpenBSD in our math library. Bug #40529 also sounds a bit similar, but I'm not sure if that is related as well.
Comment 4 Dimitry Andric 2019-01-31 04:01:24 PST
(In reply to Patrick Wildt from comment #3)
> We are seeing about the same issue on OpenBSD in our math library. Bug
> #40529 also sounds a bit similar, but I'm not sure if that is related as
> well.

I tried applying the change suggested there, but it leads to lots of 'Bad machine code: Using an undefined physical register' errors when compiling lib/msun.  It does seem to help for the specific, minimized test case I have posted in this bug.
Comment 5 Umesh Kalappa 2019-01-31 23:44:07 PST
Dimitry , with clang -O2 -mllvm -enable-misched=0  rempio-min.c -o rempio-min ,the testcase works and x86 scheduler has the issue and similar to #40529 (but those changes has no effect on your case),so we are revisiting the fix .
Comment 6 Craig Topper 2019-02-08 15:31:09 PST
Fixed in r353489