We sometimes fail to recognise that ISD::SUB(x,y) and X86ISD::SUB(x,y) can be represented by the X86ISD::SUB(x,y). https://godbolt.org/z/-RFU9J #include <x86intrin.h> using u64 = unsigned long long; u64 test_sub1(u64 *p, u64 x) { u64 r = *p - x; _subborrow_u64(0, *p, x, p); return *p - r; // zero } u64 test_sub2(u64 *p, u64 x) { u64 r = *p - x; _subborrow_u64(0, *p, x, p); return r - *p; // zero } _Z9test_sub1Pyy: movq (%rdi), %rcx movq %rcx, %rax subq %rsi, %rax movq %rax, (%rdi) subq %rcx, %rsi addq %rsi, %rax retq _Z9test_sub2Pyy: subq %rsi, (%rdi) xorl %eax, %eax retq
The test_sub1 case is fixed by rL353044, but it relies on the SBB being simplified to a generic SUB. There's still scope to fix the more general case where the overflow flag is used.
More general test cases: https://godbolt.org/z/V13JT_ define i64 @test_sub1a(i64*, i64) { %3 = load i64, i64* %0, align 8 %4 = tail call { i8, i64 } @llvm.x86.subborrow.64(i8 0, i64 %3, i64 %1) %5 = extractvalue { i8, i64 } %4, 1 store i64 %5, i64* %0, align 8 %6 = extractvalue { i8, i64 } %4, 0 %7 = icmp eq i8 %6, 0 %8 = sub i64 %1, %3 %9 = add i64 %8, %5 %10 = select i1 %7, i64 %9, i64 0 ret i64 %10 } define i64 @test_sub2a(i64*, i64) { %3 = load i64, i64* %0, align 8 %4 = tail call { i8, i64 } @llvm.x86.subborrow.64(i8 0, i64 %3, i64 %1) %5 = extractvalue { i8, i64 } %4, 1 store i64 %5, i64* %0, align 8 %6 = extractvalue { i8, i64 } %4, 0 %7 = icmp eq i8 %6, 0 %8 = sub i64 %3, %1 %9 = add i64 %8, %5 %10 = select i1 %7, i64 0, i64 %9 ret i64 %10 } declare { i8, i64 } @llvm.x86.subborrow.64(i8, i64, i64) test_sub1a: movq (%rdi), %rcx movq %rcx, %rax subq %rsi, %rax movq %rsi, %rdx subq %rcx, %rdx addq %rax, %rdx xorl %eax, %eax subq %rsi, %rcx movq %rcx, (%rdi) cmovaeq %rdx, %rax retq test_sub2a: movq (%rdi), %rcx movq %rcx, %rdx subq %rsi, %rdx addq %rdx, %rdx xorl %eax, %eax subq %rsi, %rcx movq %rcx, (%rdi) cmovbq %rdx, %rax retq
Candidate: https://reviews.llvm.org/D58597
Merge negated ISD::SUB nodes into X86ISD::SUB equivalent https://reviews.llvm.org/D58875
Anything left after https://reviews.llvm.org/rL365791? The example from comment 2 isn't optimal, but I think that's a separate problem for cmov: _test_sub1a: ## @test_sub1a .cfi_startproc ## %bb.0: xorl %eax, %eax subq %rsi, (%rdi) cmovaeq %rax, %rax retq
define i64 @test_sub2a(i64*, i64) { %3 = load i64, i64* %0, align 8 %4 = tail call { i8, i64 } @llvm.x86.subborrow.64(i8 0, i64 %3, i64 %1) %5 = extractvalue { i8, i64 } %4, 1 store i64 %5, i64* %0, align 8 %6 = extractvalue { i8, i64 } %4, 0 %7 = icmp eq i8 %6, 0 %8 = sub i64 %3, %1 %9 = add i64 %8, %5 %10 = select i1 %7, i64 0, i64 %9 ret i64 %10 } declare { i8, i64 } @llvm.x86.subborrow.64(i8, i64, i64) test_sub2a: # @test_sub2a movq (%rdi), %rcx movq %rcx, %rdx subq %rsi, %rdx <-- DUPLICATE? addq %rdx, %rdx xorl %eax, %eax subq %rsi, %rcx <-- DUPLICATE? movq %rcx, (%rdi) cmovbq %rdx, %rax retq This still doesn't look great.
(In reply to Sanjay Patel from comment #5) > Anything left after https://reviews.llvm.org/rL365791? > > The example from comment 2 isn't optimal, but I think that's a separate > problem for cmov: > > _test_sub1a: ## @test_sub1a > .cfi_startproc > ## %bb.0: > xorl %eax, %eax > subq %rsi, (%rdi) > cmovaeq %rax, %rax > retq Are we missing a CMOV(X,X) -> X combine to fix this?
(In reply to Simon Pilgrim from comment #7) > Are we missing a CMOV(X,X) -> X combine to fix this? Yes - we have that for generic (v)select, but this gets converted to cmov, and there's no simplification for this case.
(In reply to Simon Pilgrim from comment #6) > define i64 @test_sub2a(i64*, i64) { > %3 = load i64, i64* %0, align 8 > %4 = tail call { i8, i64 } @llvm.x86.subborrow.64(i8 0, i64 %3, i64 %1) > %5 = extractvalue { i8, i64 } %4, 1 > store i64 %5, i64* %0, align 8 > %6 = extractvalue { i8, i64 } %4, 0 > %7 = icmp eq i8 %6, 0 > %8 = sub i64 %3, %1 > %9 = add i64 %8, %5 > %10 = select i1 %7, i64 0, i64 %9 > ret i64 %10 > } > > declare { i8, i64 } @llvm.x86.subborrow.64(i8, i64, i64) > > test_sub2a: # @test_sub2a > movq (%rdi), %rcx > movq %rcx, %rdx > subq %rsi, %rdx <-- DUPLICATE? > addq %rdx, %rdx > xorl %eax, %eax > subq %rsi, %rcx <-- DUPLICATE? > movq %rcx, (%rdi) > cmovbq %rdx, %rax > retq > > This still doesn't look great. Looks like the sub was duplicated when the DAG was linearized since we can't pass the flags past the addq. And the addq is dependent on the sub.
(In reply to Sanjay Patel from comment #8) > (In reply to Simon Pilgrim from comment #7) > > Are we missing a CMOV(X,X) -> X combine to fix this? > > Yes - we have that for generic (v)select, but this gets converted to cmov, > and there's no simplification for this case. This seems unlikely to happen in the real world, but just in case: https://reviews.llvm.org/rL365998
(In reply to Craig Topper from comment #9) > (In reply to Simon Pilgrim from comment #6) > > define i64 @test_sub2a(i64*, i64) { > > %3 = load i64, i64* %0, align 8 > > %4 = tail call { i8, i64 } @llvm.x86.subborrow.64(i8 0, i64 %3, i64 %1) > > %5 = extractvalue { i8, i64 } %4, 1 > > store i64 %5, i64* %0, align 8 > > %6 = extractvalue { i8, i64 } %4, 0 > > %7 = icmp eq i8 %6, 0 > > %8 = sub i64 %3, %1 > > %9 = add i64 %8, %5 > > %10 = select i1 %7, i64 0, i64 %9 > > ret i64 %10 > > } > > > > declare { i8, i64 } @llvm.x86.subborrow.64(i8, i64, i64) > > > > test_sub2a: # @test_sub2a > > movq (%rdi), %rcx > > movq %rcx, %rdx > > subq %rsi, %rdx <-- DUPLICATE? > > addq %rdx, %rdx > > xorl %eax, %eax > > subq %rsi, %rcx <-- DUPLICATE? > > movq %rcx, (%rdi) > > cmovbq %rdx, %rax > > retq > > > > This still doesn't look great. > > Looks like the sub was duplicated when the DAG was linearized since we can't > pass the flags past the addq. And the addq is dependent on the sub. Yeah, the DAG looks ideal based on what we set out to do in this report: merge sub and x86sub. Do we want to break up an x86sub if it has direct and indirect uses within some other instruction? The problem with that is that the indirect use could be arbitrarily indirect - we might need to walk the entire DAG to find it.
(In reply to Sanjay Patel from comment #11) > > Yeah, the DAG looks ideal based on what we set out to do in this report: > merge sub and x86sub. Do we want to break up an x86sub if it has direct and > indirect uses within some other instruction? The problem with that is that > the indirect use could be arbitrarily indirect - we might need to walk the > entire DAG to find it. I think this could become: test_sub2a: # @test_sub2a movq (%rdi), %rcx xorl %eax, %eax subq %rsi, %rcx leaq (%rcx, %rcx), %rdx movq %rcx, (%rdi) cmovbq %rdx, %rax retq Is it worth it? https://gcc.godbolt.org/z/iK-5ct
(In reply to Simon Pilgrim from comment #12) > > test_sub2a: # @test_sub2a > movq (%rdi), %rcx > xorl %eax, %eax > subq %rsi, %rcx > leaq (%rcx, %rcx), %rdx > movq %rcx, (%rdi) > cmovbq %rdx, %rax > retq > > Is it worth it? https://gcc.godbolt.org/z/iK-5ct We're relying on the fact that 'lea' doesn't affect flags. So at its minimal form, I think we're asking if we should more generally replace 'add' with 'lea': https://gcc.godbolt.org/z/A4Hgdq IIUC, we don't do that because 'lea' doesn't always have full throughput like 'add' (as shown there for Haswell). But we could add a heuristic to make it more likely to create 'lea' and improve this example (and a few other existing regression tests): https://reviews.llvm.org/D64707
I think the last fix for this bug is: https://reviews.llvm.org/rL366431 It just missed the cut-off for the clang 9.0 release (r366426). I'll let others decide if we can resolve this now and/or it's worth trying to merge to the branch.
(In reply to Sanjay Patel from comment #14) > I think the last fix for this bug is: > https://reviews.llvm.org/rL366431 > > It just missed the cut-off for the clang 9.0 release (r366426). > > I'll let others decide if we can resolve this now and/or it's worth trying > to merge to the branch. If possible I'd like to get rL366431 merged into the 9.00 release branch - Craig do you agree?
I agree with merging rL366431 to 9.0
Merged in r366704.