New user self-registration is disabled due to spam. For an account please email bugs-admin@lists.llvm.org with your e-mail address and full name.

Bug 40483 - [X86] Failure to merge ISD::SUB(x,y) and X86ISD::SUB(x,y)
Summary: [X86] Failure to merge ISD::SUB(x,y) and X86ISD::SUB(x,y)
Status: RESOLVED FIXED
Alias: None
Product: libraries
Classification: Unclassified
Component: Backend: X86 (show other bugs)
Version: trunk
Hardware: PC Windows NT
: P enhancement
Assignee: Unassigned LLVM Bugs
URL:
Keywords:
Depends on:
Blocks: release-9.0.0
  Show dependency tree
 
Reported: 2019-01-26 13:38 PST by Simon Pilgrim
Modified: 2019-07-25 07:25 PDT (History)
5 users (show)

See Also:
Fixed By Commit(s): r353044,r354771


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Simon Pilgrim 2019-01-26 13:38:31 PST
We sometimes fail to recognise that ISD::SUB(x,y) and X86ISD::SUB(x,y) can be represented by the X86ISD::SUB(x,y).

https://godbolt.org/z/-RFU9J

#include <x86intrin.h>

using u64 = unsigned long long;

u64 test_sub1(u64 *p, u64 x) {
  u64 r = *p - x;
  _subborrow_u64(0, *p, x, p);
  return *p - r; // zero
}

u64 test_sub2(u64 *p, u64 x) {
  u64 r = *p - x;
  _subborrow_u64(0, *p, x, p);
  return r - *p; // zero
}

_Z9test_sub1Pyy:
        movq    (%rdi), %rcx
        movq    %rcx, %rax
        subq    %rsi, %rax
        movq    %rax, (%rdi)
        subq    %rcx, %rsi
        addq    %rsi, %rax
        retq
_Z9test_sub2Pyy:
        subq    %rsi, (%rdi)
        xorl    %eax, %eax
        retq
Comment 1 Simon Pilgrim 2019-02-04 05:48:04 PST
The test_sub1 case is fixed by rL353044, but it relies on the SBB being simplified to a generic SUB.

There's still scope to fix the more general case where the overflow flag is used.
Comment 2 Simon Pilgrim 2019-02-04 06:01:51 PST
More general test cases: https://godbolt.org/z/V13JT_

define i64 @test_sub1a(i64*, i64) {
  %3 = load i64, i64* %0, align 8
  %4 = tail call { i8, i64 } @llvm.x86.subborrow.64(i8 0, i64 %3, i64 %1)
  %5 = extractvalue { i8, i64 } %4, 1
  store i64 %5, i64* %0, align 8
  %6 = extractvalue { i8, i64 } %4, 0
  %7 = icmp eq i8 %6, 0
  %8 = sub i64 %1, %3
  %9 = add i64 %8, %5
  %10 = select i1 %7, i64 %9, i64 0
  ret i64 %10
}
define i64 @test_sub2a(i64*, i64) {
  %3 = load i64, i64* %0, align 8
  %4 = tail call { i8, i64 } @llvm.x86.subborrow.64(i8 0, i64 %3, i64 %1)
  %5 = extractvalue { i8, i64 } %4, 1
  store i64 %5, i64* %0, align 8
  %6 = extractvalue { i8, i64 } %4, 0
  %7 = icmp eq i8 %6, 0
  %8 = sub i64 %3, %1
  %9 = add i64 %8, %5
  %10 = select i1 %7, i64 0, i64 %9
  ret i64 %10
}
declare { i8, i64 } @llvm.x86.subborrow.64(i8, i64, i64)

test_sub1a:
        movq    (%rdi), %rcx
        movq    %rcx, %rax
        subq    %rsi, %rax
        movq    %rsi, %rdx
        subq    %rcx, %rdx
        addq    %rax, %rdx
        xorl    %eax, %eax
        subq    %rsi, %rcx
        movq    %rcx, (%rdi)
        cmovaeq %rdx, %rax
        retq
test_sub2a:
        movq    (%rdi), %rcx
        movq    %rcx, %rdx
        subq    %rsi, %rdx
        addq    %rdx, %rdx
        xorl    %eax, %eax
        subq    %rsi, %rcx
        movq    %rcx, (%rdi)
        cmovbq  %rdx, %rax
        retq
Comment 3 Simon Pilgrim 2019-02-24 14:13:56 PST
Candidate: https://reviews.llvm.org/D58597
Comment 4 Simon Pilgrim 2019-05-12 06:07:54 PDT
Merge negated ISD::SUB nodes into X86ISD::SUB equivalent 
https://reviews.llvm.org/D58875
Comment 5 Sanjay Patel 2019-07-11 11:08:13 PDT
Anything left after https://reviews.llvm.org/rL365791?

The example from comment 2 isn't optimal, but I think that's a separate problem for cmov:

_test_sub1a:                            ## @test_sub1a
	.cfi_startproc
## %bb.0:
	xorl	%eax, %eax
	subq	%rsi, (%rdi)
	cmovaeq	%rax, %rax
	retq
Comment 6 Simon Pilgrim 2019-07-12 10:59:54 PDT
define i64 @test_sub2a(i64*, i64) {
  %3 = load i64, i64* %0, align 8
  %4 = tail call { i8, i64 } @llvm.x86.subborrow.64(i8 0, i64 %3, i64 %1)
  %5 = extractvalue { i8, i64 } %4, 1
  store i64 %5, i64* %0, align 8
  %6 = extractvalue { i8, i64 } %4, 0
  %7 = icmp eq i8 %6, 0
  %8 = sub i64 %3, %1
  %9 = add i64 %8, %5
  %10 = select i1 %7, i64 0, i64 %9
  ret i64 %10
}

declare { i8, i64 } @llvm.x86.subborrow.64(i8, i64, i64)

test_sub2a:                             # @test_sub2a
        movq    (%rdi), %rcx
        movq    %rcx, %rdx
        subq    %rsi, %rdx   <-- DUPLICATE?
        addq    %rdx, %rdx
        xorl    %eax, %eax
        subq    %rsi, %rcx   <-- DUPLICATE?
        movq    %rcx, (%rdi)
        cmovbq  %rdx, %rax
        retq

This still doesn't look great.
Comment 7 Simon Pilgrim 2019-07-12 11:00:34 PDT
(In reply to Sanjay Patel from comment #5)
> Anything left after https://reviews.llvm.org/rL365791?
> 
> The example from comment 2 isn't optimal, but I think that's a separate
> problem for cmov:
> 
> _test_sub1a:                            ## @test_sub1a
> 	.cfi_startproc
> ## %bb.0:
> 	xorl	%eax, %eax
> 	subq	%rsi, (%rdi)
> 	cmovaeq	%rax, %rax
> 	retq

Are we missing a CMOV(X,X) -> X combine to fix this?
Comment 8 Sanjay Patel 2019-07-12 11:11:45 PDT
(In reply to Simon Pilgrim from comment #7) 
> Are we missing a CMOV(X,X) -> X combine to fix this?

Yes - we have that for generic (v)select, but this gets converted to cmov, and there's no simplification for this case.
Comment 9 Craig Topper 2019-07-12 11:14:13 PDT
(In reply to Simon Pilgrim from comment #6)
> define i64 @test_sub2a(i64*, i64) {
>   %3 = load i64, i64* %0, align 8
>   %4 = tail call { i8, i64 } @llvm.x86.subborrow.64(i8 0, i64 %3, i64 %1)
>   %5 = extractvalue { i8, i64 } %4, 1
>   store i64 %5, i64* %0, align 8
>   %6 = extractvalue { i8, i64 } %4, 0
>   %7 = icmp eq i8 %6, 0
>   %8 = sub i64 %3, %1
>   %9 = add i64 %8, %5
>   %10 = select i1 %7, i64 0, i64 %9
>   ret i64 %10
> }
> 
> declare { i8, i64 } @llvm.x86.subborrow.64(i8, i64, i64)
> 
> test_sub2a:                             # @test_sub2a
>         movq    (%rdi), %rcx
>         movq    %rcx, %rdx
>         subq    %rsi, %rdx   <-- DUPLICATE?
>         addq    %rdx, %rdx
>         xorl    %eax, %eax
>         subq    %rsi, %rcx   <-- DUPLICATE?
>         movq    %rcx, (%rdi)
>         cmovbq  %rdx, %rax
>         retq
> 
> This still doesn't look great.

Looks like the sub was duplicated when the DAG was linearized since we can't pass the flags past the addq. And the addq is dependent on the sub.
Comment 10 Sanjay Patel 2019-07-13 05:06:58 PDT
(In reply to Sanjay Patel from comment #8)
> (In reply to Simon Pilgrim from comment #7) 
> > Are we missing a CMOV(X,X) -> X combine to fix this?
> 
> Yes - we have that for generic (v)select, but this gets converted to cmov,
> and there's no simplification for this case.

This seems unlikely to happen in the real world, but just in case:
https://reviews.llvm.org/rL365998
Comment 11 Sanjay Patel 2019-07-13 05:14:19 PDT
(In reply to Craig Topper from comment #9)
> (In reply to Simon Pilgrim from comment #6)
> > define i64 @test_sub2a(i64*, i64) {
> >   %3 = load i64, i64* %0, align 8
> >   %4 = tail call { i8, i64 } @llvm.x86.subborrow.64(i8 0, i64 %3, i64 %1)
> >   %5 = extractvalue { i8, i64 } %4, 1
> >   store i64 %5, i64* %0, align 8
> >   %6 = extractvalue { i8, i64 } %4, 0
> >   %7 = icmp eq i8 %6, 0
> >   %8 = sub i64 %3, %1
> >   %9 = add i64 %8, %5
> >   %10 = select i1 %7, i64 0, i64 %9
> >   ret i64 %10
> > }
> > 
> > declare { i8, i64 } @llvm.x86.subborrow.64(i8, i64, i64)
> > 
> > test_sub2a:                             # @test_sub2a
> >         movq    (%rdi), %rcx
> >         movq    %rcx, %rdx
> >         subq    %rsi, %rdx   <-- DUPLICATE?
> >         addq    %rdx, %rdx
> >         xorl    %eax, %eax
> >         subq    %rsi, %rcx   <-- DUPLICATE?
> >         movq    %rcx, (%rdi)
> >         cmovbq  %rdx, %rax
> >         retq
> > 
> > This still doesn't look great.
> 
> Looks like the sub was duplicated when the DAG was linearized since we can't
> pass the flags past the addq. And the addq is dependent on the sub.

Yeah, the DAG looks ideal based on what we set out to do in this report: merge sub and x86sub. Do we want to break up an x86sub if it has direct and indirect uses within some other instruction? The problem with that is that the indirect use could be arbitrarily indirect - we might need to walk the entire DAG to find it.
Comment 12 Simon Pilgrim 2019-07-13 12:48:52 PDT
(In reply to Sanjay Patel from comment #11)
> 
> Yeah, the DAG looks ideal based on what we set out to do in this report:
> merge sub and x86sub. Do we want to break up an x86sub if it has direct and
> indirect uses within some other instruction? The problem with that is that
> the indirect use could be arbitrarily indirect - we might need to walk the
> entire DAG to find it.

I think this could become: 

test_sub2a:                             # @test_sub2a
  movq    (%rdi), %rcx
  xorl    %eax, %eax
  subq    %rsi, %rcx
  leaq    (%rcx, %rcx), %rdx
  movq    %rcx, (%rdi)
  cmovbq  %rdx, %rax
  retq

Is it worth it? https://gcc.godbolt.org/z/iK-5ct
Comment 13 Sanjay Patel 2019-07-14 07:26:11 PDT
(In reply to Simon Pilgrim from comment #12)
> 
> test_sub2a:                             # @test_sub2a
>   movq    (%rdi), %rcx
>   xorl    %eax, %eax
>   subq    %rsi, %rcx
>   leaq    (%rcx, %rcx), %rdx
>   movq    %rcx, (%rdi)
>   cmovbq  %rdx, %rax
>   retq
> 
> Is it worth it? https://gcc.godbolt.org/z/iK-5ct

We're relying on the fact that 'lea' doesn't affect flags. So at its minimal form,  I think we're asking if we should more generally replace 'add' with 'lea':
https://gcc.godbolt.org/z/A4Hgdq

IIUC, we don't do that because 'lea' doesn't always have full throughput like 'add' (as shown there for Haswell). 

But we could add a heuristic to make it more likely to create 'lea' and improve this example (and a few other existing regression tests):
https://reviews.llvm.org/D64707
Comment 14 Sanjay Patel 2019-07-18 05:54:36 PDT
I think the last fix for this bug is:
https://reviews.llvm.org/rL366431

It just missed the cut-off for the clang 9.0 release (r366426). 

I'll let others decide if we can resolve this now and/or it's worth trying to merge to the branch.
Comment 15 Simon Pilgrim 2019-07-19 03:06:42 PDT
(In reply to Sanjay Patel from comment #14)
> I think the last fix for this bug is:
> https://reviews.llvm.org/rL366431
> 
> It just missed the cut-off for the clang 9.0 release (r366426). 
> 
> I'll let others decide if we can resolve this now and/or it's worth trying
> to merge to the branch.

If possible I'd like to get rL366431 merged into the 9.00 release branch - Craig do you agree?
Comment 16 Craig Topper 2019-07-19 09:09:43 PDT
I agree with merging rL366431 to 9.0
Comment 17 Hans Wennborg 2019-07-22 10:33:07 PDT
Merged in r366704.