As reported in https://bugs.freebsd.org/234040, after the import of clang 7.0 a few math library tests started failing. Bisection over llvm trunk showed that these failures were introduced via https://reviews.llvm.org/rL329339 ("[X86] Remove some InstRWs for plain store instructions on Sandy Bridge. We were forcing the latency of these instructions to 5 cycles, but every other scheduler model had them as 1 cycle. I'm sure I didn't get everything, but this gets a big portion.") Test case, minimized from the loop here: https://github.com/freebsd/freebsd/blob/master/lib/msun/ld80/e_rem_pio2l.h#L131 // clang -O2 rempio-min.c -o rempio-min long double __attribute__((noinline)) rem_pio2l_min(long double z) { int i; double tx[2]; for (i = 0; i < 2; ++i) { tx[i] = (double)((int)(z)); z = (z - tx[i]) * 1.6777216e+07; } return z; } int main(void) { const long double test1 = 0x1.b2f3ee96e7600326p+23L; const long double check1 = 0x1.93p+16; long double res; res = rem_pio2l_min(test1); return res == check1 ? 0 : 1; } Side-by-side diff of clang r329338 (left) and r329339 (right) assembly output, hoping that bugzilla won't mess it up too badly: rem_pio2l_min: rem_pio2l_min: .cfi_startproc .cfi_startproc pushq %rbp pushq %rbp .cfi_def_cfa_offset 16 .cfi_def_cfa_offset 16 .cfi_offset %rbp, -16 .cfi_offset %rbp, -16 movq %rsp, %rbp movq %rsp, %rbp .cfi_def_cfa_register %rbp .cfi_def_cfa_register %rbp fnstcw -4(%rbp) fnstcw -4(%rbp) fldt 16(%rbp) < movzwl -4(%rbp), %eax movzwl -4(%rbp), %eax movw $3199, -4(%rbp) movw $3199, -4(%rbp) fldcw -4(%rbp) fldcw -4(%rbp) > fldt 16(%rbp) movw %ax, -4(%rbp) movw %ax, -4(%rbp) fistl -8(%rbp) fistl -8(%rbp) fldcw -4(%rbp) fldcw -4(%rbp) cvtsi2sdl -8(%rbp), %xmm0 cvtsi2sdl -8(%rbp), %xmm0 movsd %xmm0, -32(%rbp) movsd %xmm0, -32(%rbp) fsubl -32(%rbp) fsubl -32(%rbp) flds .LCPI0_0(%rip) < fnstcw -2(%rbp) fnstcw -2(%rbp) fmul %st(0), %st(1) | flds .LCPI0_0(%rip) movzwl -2(%rbp), %eax movzwl -2(%rbp), %eax movw $3199, -2(%rbp) movw $3199, -2(%rbp) fldcw -2(%rbp) fldcw -2(%rbp) > fmul %st(0), %st(1) movw %ax, -2(%rbp) movw %ax, -2(%rbp) fxch %st(1) fxch %st(1) fistl -12(%rbp) fistl -12(%rbp) fldcw -2(%rbp) fldcw -2(%rbp) xorps %xmm0, %xmm0 xorps %xmm0, %xmm0 cvtsi2sdl -12(%rbp), %xmm0 cvtsi2sdl -12(%rbp), %xmm0 movsd %xmm0, -24(%rbp) movsd %xmm0, -24(%rbp) fsubl -24(%rbp) fsubl -24(%rbp) fmulp %st(1) fmulp %st(1) popq %rbp popq %rbp retq retq
Btw, there is a similar loop here: https://github.com/freebsd/freebsd/blob/master/lib/msun/ld80/e_powl.c#L29 which is also unrolled incorrectly, in particular in the instance where it is unrolled 6 times: https://github.com/freebsd/freebsd/blob/master/lib/msun/ld80/e_powl.c#L482 This is the reason that our powl() tests failed after r329339 too, but this is harder to reduce to a small test case.
An even simpler representation, which gives the same effect: long double __attribute__((noinline)) rem_pio2l_min(long double z) { z = (z - (int)z) * 1.6777216e+07; z = (z - (int)z) * 1.6777216e+07; return z; }
We are seeing about the same issue on OpenBSD in our math library. Bug #40529 also sounds a bit similar, but I'm not sure if that is related as well.
(In reply to Patrick Wildt from comment #3) > We are seeing about the same issue on OpenBSD in our math library. Bug > #40529 also sounds a bit similar, but I'm not sure if that is related as > well. I tried applying the change suggested there, but it leads to lots of 'Bad machine code: Using an undefined physical register' errors when compiling lib/msun. It does seem to help for the specific, minimized test case I have posted in this bug.
Dimitry , with clang -O2 -mllvm -enable-misched=0 rempio-min.c -o rempio-min ,the testcase works and x86 scheduler has the issue and similar to #40529 (but those changes has no effect on your case),so we are revisiting the fix .
Fixed in r353489