LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 29154 - shrink-wraping missing
Summary: shrink-wraping missing
Status: NEW
Alias: None
Product: clang
Classification: Unclassified
Component: LLVM Codegen (show other bugs)
Version: unspecified
Hardware: PC All
: P normal
Assignee: Unassigned Clang Bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-08-26 12:53 PDT by Dehao Chen
Modified: 2017-05-24 13:30 PDT (History)
8 users (show)

See Also:
Fixed By Commit(s):


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dehao Chen 2016-08-26 12:53:17 PDT
struct x {
  int bar();
};

struct y {
  int foo();
  struct x *m_;
  int n_;
};

int y::foo() {
  if (m_ == 0)
    return -1;
  int n;
  do { (n) = __atomic_load_n((&n_), 0); } while (0);
  if (__builtin_expect(n == -1, 0)) {
    n = m_->bar();
    __atomic_store_n((&n_), (n), 0);
  }
  return n;
}

built with -fno-omit-frame-pointer -O2

gcc version:
        cmpq    $0, (%rdi)
        je      .L3
        movl    8(%rdi), %eax
        cmpl    $-1, %eax
        je      .L9
        ret
....

llvm version:
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %rbx
        pushq   %rax
        movq    %rdi, %rbx
        movq    (%rbx), %rdi
        movl    $-1, %eax
        testq   %rdi, %rdi
        je      .LBB0_3
        movl    8(%rbx), %eax
        cmpl    $-1, %eax
        je      .LBB0_2
.LBB0_3:
        addq    $8, %rsp
        popq    %rbx
        popq    %rbp
        retq
......

GCC takes 6 instructions while llvm takes 16
Comment 1 Reid Kleckner 2016-10-20 11:45:32 PDT
Register allocation happens before shrink wrapping, so we copy RDI into RBX in the entry block to avoid spilling it around the call. That use of RBX blocks the shrink wrapping pass.
Comment 2 Wei Mi 2016-10-20 11:59:09 PDT
> Register allocation happens before shrink wrapping, so we copy RDI into RBX in the entry block to avoid spilling it around the call. That use of RBX blocks the shrink wrapping pass.

We can avoid copying RDI into RBX by splitting the param's live range allocated to RDI. Gcc doesn't generate the copy instructions in prologue because it has the optimization called split_live_ranges_for_shrink_wrap which will split the live range of input params across call. llvm doesn't contain such kind of splitting. I am planning to add such splitting.

Here is another testcase revealing the same problem:
------------------------------------------------------------------------
volatile long *p;
long cond, ret;
void goo(long i, long j, long k);
void hoo();
void foo(long i, long j, long k) {
  if (__builtin_expect(cond, 0)) {
    goo(i, j, k);
    ret = i + j + k;
  }
  p = __builtin_alloca(3);
}
------------------------------------------------------------------------

***** gcc code *****
foo:
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %r13
        pushq   %r12
        pushq   %rbx
        subq    $8, %rsp
        cmpq    $0, cond(%rip)
        jne     .L5
.L2:
        subq    $32, %rsp
        leaq    15(%rsp), %rax
        andq    $-16, %rax
        movq    %rax, p(%rip)
        leaq    -24(%rbp), %rsp
        popq    %rbx
        popq    %r12
        popq    %r13
        popq    %rbp
        ret
        .p2align 4,,10
        .p2align 3
.L5:
        movq    %rsi, %r13
        movq    %rdi, %rbx
        movq    %rdx, %r12
        call    goo
        leaq    (%rbx,%r13), %rdi
        leaq    (%rdi,%r12), %rdx
        movq    %rdx, ret(%rip)
        jmp     .L2
***** end gcc code *****

***** llvm code *****
        .type   foo,@function
foo:                                    # @foo
# BB#0:                                 # %entry
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %r15
        pushq   %r14
        pushq   %rbx
        pushq   %rax
        movq    %rdx, %r14          //  the mov inst is not in gcc code
        movq    %rsi, %r15           //  the mov inst is not in gcc code
        movq    %rdi, %rbx           //  the mov inst is not in gcc code
        cmpq    $0, cond(%rip)
        jne     .LBB0_1
.LBB0_2:                                # %if.end
        movq    %rsp, %rax
        addq    $-16, %rax
        movq    %rax, %rsp
        movq    %rax, p(%rip)
        leaq    -24(%rbp), %rsp
        popq    %rbx
        popq    %r14
        popq    %r15
        popq    %rbp
        retq
.LBB0_1:                                # %if.then
        movq    %rbx, %rdi
        movq    %r15, %rsi
        movq    %r14, %rdx
        callq   goo
        addq    %rbx, %r15
        addq    %r14, %r15
        movq    %r15, ret(%rip)
        jmp     .LBB0_2
.Lfunc_end0:
        .size   foo, .Lfunc_end0-foo
        .cfi_endproc
***** end llvm code *****
Comment 3 Quentin Colombet 2016-10-20 17:20:28 PDT
(In reply to comment #2)
> > Register allocation happens before shrink wrapping, so we copy RDI into RBX in the entry block to avoid spilling it around the call. That use of RBX blocks the shrink wrapping pass.
> 
> We can avoid copying RDI into RBX by splitting the param's live range
> allocated to RDI. Gcc doesn't generate the copy instructions in prologue
> because it has the optimization called split_live_ranges_for_shrink_wrap
> which will split the live range of input params across call. llvm doesn't
> contain such kind of splitting. I am planning to add such splitting.

Before adding a new splitting scheme, could you check why the current ones do not do what we want?

I suspect the region splitting happens, but it would also be interesting to see why this is not just within the basic block of the call.
Comment 4 Wei Mi 2016-10-21 00:12:28 PDT
> > We can avoid copying RDI into RBX by splitting the param's live range
> > allocated to RDI. Gcc doesn't generate the copy instructions in prologue
> > because it has the optimization called split_live_ranges_for_shrink_wrap
> > which will split the live range of input params across call. llvm doesn't
> > contain such kind of splitting. I am planning to add such splitting.
> 
> Before adding a new splitting scheme, could you check why the current ones
> do not do what we want?
> 
> I suspect the region splitting happens, but it would also be interesting to
> see why this is not just within the basic block of the call.

For the testcase I pasted, the available registers are more than enough so no splitting of any kind will be triggered.

The problem of the testcase is:
For the param i, which lives across goo, although we can easily find a callee save register for it, the best register for it is RDI (so that the copy from RDI in prologue can be removed). We need to split the live range of params acrossing calls no matter how the program's reg pressure is. I think such special splitting may better be done separately before the regular regalloc starts, since it is not compatible with the existing splitting mechanism which is triggered by no available physreg.      

------------------------------------------------------------------------
volatile long *p;
long cond, ret;
void goo(long i, long j, long k);
void hoo();
void foo(long i, long j, long k) {
  if (__builtin_expect(cond, 0)) {
    goo(i, j, k);
    ret = i + j + k;
  }
  p = __builtin_alloca(3);
}
------------------------------------------------------------------------
Comment 5 Hal Finkel 2016-10-21 10:25:20 PDT
> I think such special splitting may better be done separately before the regular regalloc starts, since it is not compatible with the existing splitting mechanism which is triggered by no available physreg.      

Is it also correct to say that this is only profitable if we actually shrink wrap?
Comment 6 Wei Mi 2016-10-21 11:24:56 PDT
(In reply to comment #5)
> > I think such special splitting may better be done separately before the regular regalloc starts, since it is not compatible with the existing splitting mechanism which is triggered by no available physreg.      
> 
> Is it also correct to say that this is only profitable if we actually shrink
> wrap?

It may be profitable even shrink wrap doesn't actually happen. Look at the code gcc and llvm generate for the testcase I pasted above. Both gcc and llvm don't do shrink wrap, but gcc generates better code because of the splitting. So maybe we can find a better name than split_live_ranges_for_shrink_wrap in gcc.
Comment 7 Quentin Colombet 2016-10-21 11:58:35 PDT
> I think such special splitting may better be done separately before the
> regular regalloc starts, since it is not compatible with the existing
> splitting mechanism which is triggered by no available physreg.      

So far we tried to stay avoid of any form of pre-splitting to not introduce another set of heuristics. Wouldn't this be caught by some lazy code motion or post RA scheduling?
(Assuming those optimizations would work at a bigger scope than basic block.)

What I am saying is if this is something we can optimize after RA, I would rather go that direction. Maybe it is time to push shrink-wrapping to the next level.
Comment 8 Wei Mi 2016-10-21 12:37:04 PDT
> So far we tried to stay avoid of any form of pre-splitting to not introduce
> another set of heuristics. Wouldn't this be caught by some lazy code motion
> or post RA scheduling?
> (Assuming those optimizations would work at a bigger scope than basic block.)
> 
> What I am saying is if this is something we can optimize after RA, I would
> rather go that direction. Maybe it is time to push shrink-wrapping to the
> next level.

I think it is much easier to do this before RA. Duing RA, still for the testcase I posted, if i is not splitted, it will be allocated to a callee-save-register, for example, rbx, since it has a use after calling goo.

prologue:
  movq    %rdi, %rbx 
  ...
body:
  ...
  movq    %rbx, %rdi
  callq   goo
  addq    %rbx, %r15

It is harder for post RA optimization to change the generated code to the form below, since it still requires collecting global information that rdi is not changed from any path from entry to goo callsite. Post RA optimization usually tends to be simple and BB local. 

Another reason is that there is a great chance that post RA optimization cannot do the transformation above if RA chooses to use rdi for another live range somewhere between prologue and goo callsite.

prologue:
  movq    %rdi, %rdi 
  ...
body:
  ...
  movq    %rdi, %rbx
  callq   goo
  addq    %rbx, %r15

However, it is more natural to achieve that by splitting i before regalloc, and then regalloc will take care of everything and generate the good code we expect. The splitting heuristic will also be simple: we only have to split i's live range to two interval. First find the smallest region covering all the callsites. then split i's live range at the boarder of the region.

Since the splitting is limited for function params, it will not have very widespread impact. The interaction with regalloc is relatively small. So I believe the complexity and impact of such splitting will both be under control.