LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 40486 - CodeGenPrepare::CombineUAddWithOverflow - improve icmp eq/ne matching
Summary: CodeGenPrepare::CombineUAddWithOverflow - improve icmp eq/ne matching
Status: RESOLVED FIXED
Alias: None
Product: libraries
Classification: Unclassified
Component: Common Code Generator Code (show other bugs)
Version: trunk
Hardware: PC Windows NT
: P enhancement
Assignee: Unassigned LLVM Bugs
URL:
Keywords:
Depends on:
Blocks: 31754
  Show dependency tree
 
Reported: 2019-01-27 03:42 PST by Simon Pilgrim
Modified: 2019-02-24 08:42 PST (History)
6 users (show)

See Also:
Fixed By Commit(s): 354746


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Simon Pilgrim 2019-01-27 03:42:26 PST
Reviewing the addcarry cases from [Bug #31754 Comment #3]:

https://godbolt.org/z/eZWlsG

The IR demonstrates that much of the problem is that while CombineUAddWithOverflow can match the ult/ugt cases of the add result, it can't handle ult/ugt canonicalizations (against 1/-1/signbit/etc) that result in a icmp eq/ne instead:

define void @_Z4testILy1EEvRyS0_(i64*, i64*) {
  %3 = load i64, i64* %0, align 8
  %4 = add i64 %3, 1
  store i64 %4, i64* %0, align 8
  %5 = icmp eq i64 %4, 0 // <---- fails to match uaddo
  %6 = zext i1 %5 to i8
  %7 = load i64, i64* %1, align 8
  %8 = tail call { i8, i64 } @llvm.x86.addcarry.64(i8 %6, i64 %7, i64 0) #2
  %9 = extractvalue { i8, i64 } %8, 1
  store i64 %9, i64* %1, align 8
  ret void
}

define void @_Z4testILy2EEvRyS0_(i64*, i64*) {
  %3 = load i64, i64* %0, align 8
  %4 = add i64 %3, 2
  store i64 %4, i64* %0, align 8
  %5 = icmp ult i64 %4, 2 // <---- matches uaddo
  %6 = zext i1 %5 to i8
  %7 = load i64, i64* %1, align 8
  %8 = tail call { i8, i64 } @llvm.x86.addcarry.64(i8 %6, i64 %7, i64 0) #2
  %9 = extractvalue { i8, i64 } %8, 1
  store i64 %9, i64* %1, align 8
  ret void
}

declare { i8, i64 } @llvm.x86.addcarry.64(i8, i64, i64)
Comment 1 Simon Pilgrim 2019-01-27 06:05:19 PST
Added tests at rL352316
Comment 2 Sanjay Patel 2019-01-29 07:07:53 PST
I'm wondering if we should fix this via IR canonicalization rather than pattern-matching proliferation.

For historical/codegen reasons, we have favored squashing IR select into math/logic as a canonicalization. But given that cmp/select may have profile metadata attached, I don't think that's the right choice so early in the optimization pipeline. 

We've reversed several of the obvious cases (where the math/logic sequence is longer) already after improving the backend. This one is instruction count neutral:

https://rise4fun.com/Alive/r9s

%z = zext i1 %x to i32
%s = sub i32 %y, %z
=>
%ym1 = sub i32 %y, 1
%s = select i1 %x, i32 %ym1, i32 %y
Comment 3 Simon Pilgrim 2019-01-31 01:25:37 PST
This was raised during triage of the issues in [Bug #31754] - if those can be fixed without needing this fix then I'm happy for this bug to be discarded.
Comment 4 Sanjay Patel 2019-01-31 06:21:44 PST
(In reply to Simon Pilgrim from comment #3)
> This was raised during triage of the issues in [Bug #31754] - if those can
> be fixed without needing this fix then I'm happy for this bug to be
> discarded.

Reference: https://reviews.llvm.org/D57453

So yes, there are independent holes raised by each. I'm still inclined to try to solve the problems generally because every time I've tried a tailored fix for a bug, its cousins come out of the woodwork.

But that said, there is a simple fix for the matcher for the most important sub-problem in bug 31754. I'll post that soon.
Comment 5 Sanjay Patel 2019-01-31 08:11:50 PST
(In reply to Sanjay Patel from comment #4)
> ...there is a simple fix for the matcher for the most important
> sub-problem in bug 31754. I'll post that soon.

https://reviews.llvm.org/D57516
Comment 6 Sanjay Patel 2019-02-01 06:20:00 PST
x86-specific IR improvement is here:
https://reviews.llvm.org/rL352870

The matcher improvement (D57516) is likely ok, but it did raise a minor x86 codegen regression, partly addressed here:
https://reviews.llvm.org/D57547

...and there's an open question about Hexagon codegen too.
Comment 7 Sanjay Patel 2019-02-04 15:08:19 PST
We should get the +1 case now:
https://reviews.llvm.org/rL352998

And if that somehow escapes, then we have DAG matching to catch it too:
https://reviews.llvm.org/rL352984

The add of SIGNED_MIN_VAL and -1 are harder (reduced from the previous godbolt link):
https://godbolt.org/z/1T3CWx

define void @_Z4testILj2147483648EEvRjS0_(i32* dereferenceable(4), i32* dereferenceable(4)) {
  %3 = load i32, i32* %0, align 4
  %4 = xor i32 %3, -2147483648
  store i32 %4, i32* %0, align 4
  %5 = lshr i32 %3, 31
  %6 = trunc i32 %5 to i8
  %7 = load i32, i32* %1, align 4
  %8 = tail call { i8, i32 } @llvm.x86.addcarry.32(i8 %6, i32 %7, i32 0) #2
  %9 = extractvalue { i8, i32 } %8, 1
  store i32 %9, i32* %1, align 4
  ret void
}

define  @_Z4testILj4294967295EEvRjS0_(i32* dereferenceable(4), i32* dereferenceable(4)) {
  %3 = load i32, i32* %0, align 4
  %4 = add i32 %3, -1
  store i32 %4, i32* %0, align 4
  %5 = icmp ne i32 %3, 0
  %6 = zext i1 %5 to i8
  %7 = load i32, i32* %1, align 4
  %8 = tail call { i8, i32 } @llvm.x86.addcarry.32(i8 %6, i32 %7, i32 0) #2
  %9 = extractvalue { i8, i32 } %8, 1
  store i32 %9, i32* %1, align 4
  ret void
}

declare { i8, i32 } @llvm.x86.addcarry.32(i8, i32, i32) #1

---------------------------------------------------------------------------

For -1, we don't have a cmp that depends on the add, so we can't just tack another pattern into the matcher. With 0x80000000, there's no add or cmp to match; we have independent xor/lshr.

Doing this in CGP is a lot of extra work for highly unlikely patterns. It's possible that this is easier to recognize in the DAG, but I'd still doubt that the effort is worth the payoff.

I'm still looking at the general IR improvements, but that's held up by a preliminary fold in the DAG:
https://reviews.llvm.org/D57401

So unless there's evidence that adding those 2 huge constants and checking for overflow is important, I think we shouldn't worry about those 2 cases.
Comment 8 Arthur O'Dwyer 2019-02-04 17:20:14 PST
Sanjay, thanks, that new patch is awesome!
I think the 0x80000000 case is just a curiosity and I won't agitate for it.
The 0xFFFFFFFF case is important, though, because it's what would be used by "BigInt::operator--()".

https://github.com/Quuxplusone/WideIntProofOfConcept/commit/f8e692731b093f48797e411d6685e9106b9c9e35
Your patch improved my "preinc" and "postinc" cases, but it didn't touch my equally important "predec" and "postdec" cases. I think (but am not 100% sure) that if you fixed the 0xFFFFFFFF case, it would improve the "predec" and "postdec" cases.
https://godbolt.org/z/XVB4hY
Comment 9 Sanjay Patel 2019-02-04 18:27:06 PST
(In reply to Arthur O'Dwyer from comment #8)
> Sanjay, thanks, that new patch is awesome!
> I think the 0x80000000 case is just a curiosity and I won't agitate for it.
> The 0xFFFFFFFF case is important, though, because it's what would be used by
> "BigInt::operator--()".
> 
> https://github.com/Quuxplusone/WideIntProofOfConcept/commit/
> f8e692731b093f48797e411d6685e9106b9c9e35
> Your patch improved my "preinc" and "postinc" cases, but it didn't touch my
> equally important "predec" and "postdec" cases. I think (but am not 100%
> sure) that if you fixed the 0xFFFFFFFF case, it would improve the "predec"
> and "postdec" cases.
> https://godbolt.org/z/XVB4hY

Thanks for the info. I’m working on a CGP patch to form unsigned subtract with overflow, and I’m hoping that will work with the -1 case. Subtract is a bit more difficult to match as noted in bug 40487, but I’ll try to get that done soon.
Comment 10 Sanjay Patel 2019-02-22 08:44:52 PST
(In reply to Arthur O'Dwyer from comment #8)
> https://godbolt.org/z/XVB4hY

Arthur - can you verify that we are getting those cases now in your original apps?

I think we are still missing a couple of add patterns from bug 31754, but getting closer...
Comment 11 Arthur O'Dwyer 2019-02-22 10:52:51 PST
Sanjay: I confirm that I now see perfect codegen for Uint128::decrement()!

If you're still looking for work ;) I see improvement, but not perfection, in Uint128::negate(). Here's an example.
https://godbolt.org/z/BGRoQ4
  xorl %eax, %eax
  xorl %ecx, %ecx
  subq (%rdi), %rcx
  sbbq 8(%rdi), %rax
  movq %rcx, (%rdi)
  movq %rax, 8(%rdi)
  retq
I'm not sure if GCC's codegen is perfect, or if there's a way to use "sbbq" to save an instruction here somehow.
  negq (%rdi)
  adcq $0, 8(%rdi)
  negq 8(%rdi)
  ret
Comment 12 Sanjay Patel 2019-02-22 11:10:59 PST
(In reply to Arthur O'Dwyer from comment #11)
> Sanjay: I confirm that I now see perfect codegen for Uint128::decrement()!

Great - thanks for checking that!

> If you're still looking for work ;) I see improvement, but not perfection,
> in Uint128::negate(). Here's an example.
> https://godbolt.org/z/BGRoQ4
>   xorl %eax, %eax
>   xorl %ecx, %ecx
>   subq (%rdi), %rcx
>   sbbq 8(%rdi), %rax
>   movq %rcx, (%rdi)
>   movq %rax, 8(%rdi)
>   retq
> I'm not sure if GCC's codegen is perfect, or if there's a way to use "sbbq"
> to save an instruction here somehow.
>   negq (%rdi)
>   adcq $0, 8(%rdi)
>   negq 8(%rdi)
>   ret

Hmmm...those 2 tests are independent of what we've done so far:
1. In the uint128 case, we're not forming the overflow intrinsic in IR (CGP) because that's not a legal type (native type for x86).
2. In the uint64 case, we are forming the overflow intrinsic now (so it's better than before), but the x86 lowering needs to be improved.

Since they end up with identical asm, I'm hoping that is a single x86-specific backend fix in lowering X86ISD::SUB to 'neg'.

Do you mind filing a new bug report with those examples? It's hard to keep all of these problems organized with the overlap.
Comment 13 Arthur O'Dwyer 2019-02-22 11:27:53 PST
Filed https://bugs.llvm.org/show_bug.cgi?id=40825 for uint128 negation.
Comment 14 Sanjay Patel 2019-02-24 08:42:01 PST
We should have all of the common cases after:
https://reviews.llvm.org/rL354746

I filed bug 40842 to track the +0x8000... case because I think the root cause is different on that one.