LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 39968 - [x86-64] Suboptimal codegen for (!x) as opposed to (0 == x)
Summary: [x86-64] Suboptimal codegen for (!x) as opposed to (0 == x)
Status: RESOLVED FIXED
Alias: None
Product: libraries
Classification: Unclassified
Component: Scalar Optimizations (show other bugs)
Version: trunk
Hardware: PC All
: P enhancement
Assignee: Unassigned LLVM Bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-12-11 22:27 PST by Arthur O'Dwyer
Modified: 2019-01-27 07:57 PST (History)
5 users (show)

See Also:
Fixed By Commit(s): 349385


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Arthur O'Dwyer 2018-12-11 22:27:49 PST
https://godbolt.org/z/Z2afe8

    #include <stdint.h>
    struct u128 { uint64_t x, y; };

    u128 shl2(uint64_t rdi, uint64_t rdx, int n)
    {
        if (n & 64) rdx = rdi;
    #if WORSE
        if (!(n & 64)) rdi = 0;
    #else
        if (0 == (n & 64)) rdi = 0;
    #endif
        return {rdi,rdx};
    }

Without -DWORSE, Clang generates perfect codegen.

    andl $64, %edx
    xorl %eax, %eax
    cmovneq %rdi, %rsi
    cmovneq %rdi, %rax
    movq %rsi, %rdx
    retq

But with -DWORSE, Clang generates a dead-store "shr" instruction:

    andl $64, %edx
    xorl %eax, %eax
    shrl $6, %edx      // USELESS INSTRUCTION!
    cmovneq %rdi, %rsi
    cmovneq %rdi, %rax
    movq %rsi, %rdx
    retq

This is surprising, given that the only difference is using (0 == x) versus (!x). I would expect these to have equivalent codegen.
Comment 1 Arthur O'Dwyer 2018-12-12 08:37:11 PST
Sorry, cut-and-paste error there. The "perfect codegen" without -DWORSE is actually

  xorl %eax, %eax
  testb $64, %dl
  cmovneq %rdi, %rsi
  cmovneq %rdi, %rax
  movq %rsi, %rdx
  retq

(That is, where -DWORSE produces "and, xor, shr", the optimal codegen produces "xor, test".)
Comment 2 Simon Pilgrim 2018-12-12 09:00:51 PST
This comes down to:

define dso_local { i64, i64 } @_Z4shl2mmi(i64, i64, i32) local_unnamed_addr #0 {
  %4 = and i32 %2, 64
  %5 = icmp eq i32 %4, 0
  %6 = select i1 %5, i64 %1, i64 %0
  %7 = select i1 %5, i64 0, i64 %0
  %8 = insertvalue { i64, i64 } undef, i64 %7, 0
  %9 = insertvalue { i64, i64 } %8, i64 %6, 1
  ret { i64, i64 } %9
}

vs

define dso_local { i64, i64 } @_Z4shl2mmi(i64, i64, i32) local_unnamed_addr #0 {
  %4 = and i32 %2, 64
  %5 = icmp ne i32 %4, 0
  %6 = select i1 %5, i64 %0, i64 %1
  %7 = select i1 %5, i64 %0, i64 0
  %8 = insertvalue { i64, i64 } undef, i64 %7, 0
  %9 = insertvalue { i64, i64 } %8, i64 %6, 1
  ret { i64, i64 } %9
}
Comment 3 Sanjay Patel 2018-12-12 13:50:00 PST
Technically, this is an IR canonicalization problem: if we have functionally equivalent code, then there should be 1 unique/minimal IR form for that. But that breaks down because the icmp has multiple uses, so we don't try to convert the predicate to "eq". 

The easier solution will probably be to add a backend transform to recognize the and+shift and turn that into 'test'.
Comment 4 Craig Topper 2018-12-17 12:05:03 PST
This should be fixed after r349385
Comment 5 Simon Pilgrim 2019-01-27 07:57:07 PST
Resolving - fixed by rL349385