struct x { int bar(); }; struct y { int foo(); struct x *m_; int n_; }; int y::foo() { if (m_ == 0) return -1; int n; do { (n) = __atomic_load_n((&n_), 0); } while (0); if (__builtin_expect(n == -1, 0)) { n = m_->bar(); __atomic_store_n((&n_), (n), 0); } return n; } built with -fno-omit-frame-pointer -O2 gcc version: cmpq $0, (%rdi) je .L3 movl 8(%rdi), %eax cmpl $-1, %eax je .L9 ret .... llvm version: pushq %rbp movq %rsp, %rbp pushq %rbx pushq %rax movq %rdi, %rbx movq (%rbx), %rdi movl $-1, %eax testq %rdi, %rdi je .LBB0_3 movl 8(%rbx), %eax cmpl $-1, %eax je .LBB0_2 .LBB0_3: addq $8, %rsp popq %rbx popq %rbp retq ...... GCC takes 6 instructions while llvm takes 16
Register allocation happens before shrink wrapping, so we copy RDI into RBX in the entry block to avoid spilling it around the call. That use of RBX blocks the shrink wrapping pass.
> Register allocation happens before shrink wrapping, so we copy RDI into RBX in the entry block to avoid spilling it around the call. That use of RBX blocks the shrink wrapping pass. We can avoid copying RDI into RBX by splitting the param's live range allocated to RDI. Gcc doesn't generate the copy instructions in prologue because it has the optimization called split_live_ranges_for_shrink_wrap which will split the live range of input params across call. llvm doesn't contain such kind of splitting. I am planning to add such splitting. Here is another testcase revealing the same problem: ------------------------------------------------------------------------ volatile long *p; long cond, ret; void goo(long i, long j, long k); void hoo(); void foo(long i, long j, long k) { if (__builtin_expect(cond, 0)) { goo(i, j, k); ret = i + j + k; } p = __builtin_alloca(3); } ------------------------------------------------------------------------ ***** gcc code ***** foo: pushq %rbp movq %rsp, %rbp pushq %r13 pushq %r12 pushq %rbx subq $8, %rsp cmpq $0, cond(%rip) jne .L5 .L2: subq $32, %rsp leaq 15(%rsp), %rax andq $-16, %rax movq %rax, p(%rip) leaq -24(%rbp), %rsp popq %rbx popq %r12 popq %r13 popq %rbp ret .p2align 4,,10 .p2align 3 .L5: movq %rsi, %r13 movq %rdi, %rbx movq %rdx, %r12 call goo leaq (%rbx,%r13), %rdi leaq (%rdi,%r12), %rdx movq %rdx, ret(%rip) jmp .L2 ***** end gcc code ***** ***** llvm code ***** .type foo,@function foo: # @foo # BB#0: # %entry pushq %rbp movq %rsp, %rbp pushq %r15 pushq %r14 pushq %rbx pushq %rax movq %rdx, %r14 // the mov inst is not in gcc code movq %rsi, %r15 // the mov inst is not in gcc code movq %rdi, %rbx // the mov inst is not in gcc code cmpq $0, cond(%rip) jne .LBB0_1 .LBB0_2: # %if.end movq %rsp, %rax addq $-16, %rax movq %rax, %rsp movq %rax, p(%rip) leaq -24(%rbp), %rsp popq %rbx popq %r14 popq %r15 popq %rbp retq .LBB0_1: # %if.then movq %rbx, %rdi movq %r15, %rsi movq %r14, %rdx callq goo addq %rbx, %r15 addq %r14, %r15 movq %r15, ret(%rip) jmp .LBB0_2 .Lfunc_end0: .size foo, .Lfunc_end0-foo .cfi_endproc ***** end llvm code *****
(In reply to comment #2) > > Register allocation happens before shrink wrapping, so we copy RDI into RBX in the entry block to avoid spilling it around the call. That use of RBX blocks the shrink wrapping pass. > > We can avoid copying RDI into RBX by splitting the param's live range > allocated to RDI. Gcc doesn't generate the copy instructions in prologue > because it has the optimization called split_live_ranges_for_shrink_wrap > which will split the live range of input params across call. llvm doesn't > contain such kind of splitting. I am planning to add such splitting. Before adding a new splitting scheme, could you check why the current ones do not do what we want? I suspect the region splitting happens, but it would also be interesting to see why this is not just within the basic block of the call.
> > We can avoid copying RDI into RBX by splitting the param's live range > > allocated to RDI. Gcc doesn't generate the copy instructions in prologue > > because it has the optimization called split_live_ranges_for_shrink_wrap > > which will split the live range of input params across call. llvm doesn't > > contain such kind of splitting. I am planning to add such splitting. > > Before adding a new splitting scheme, could you check why the current ones > do not do what we want? > > I suspect the region splitting happens, but it would also be interesting to > see why this is not just within the basic block of the call. For the testcase I pasted, the available registers are more than enough so no splitting of any kind will be triggered. The problem of the testcase is: For the param i, which lives across goo, although we can easily find a callee save register for it, the best register for it is RDI (so that the copy from RDI in prologue can be removed). We need to split the live range of params acrossing calls no matter how the program's reg pressure is. I think such special splitting may better be done separately before the regular regalloc starts, since it is not compatible with the existing splitting mechanism which is triggered by no available physreg. ------------------------------------------------------------------------ volatile long *p; long cond, ret; void goo(long i, long j, long k); void hoo(); void foo(long i, long j, long k) { if (__builtin_expect(cond, 0)) { goo(i, j, k); ret = i + j + k; } p = __builtin_alloca(3); } ------------------------------------------------------------------------
> I think such special splitting may better be done separately before the regular regalloc starts, since it is not compatible with the existing splitting mechanism which is triggered by no available physreg. Is it also correct to say that this is only profitable if we actually shrink wrap?
(In reply to comment #5) > > I think such special splitting may better be done separately before the regular regalloc starts, since it is not compatible with the existing splitting mechanism which is triggered by no available physreg. > > Is it also correct to say that this is only profitable if we actually shrink > wrap? It may be profitable even shrink wrap doesn't actually happen. Look at the code gcc and llvm generate for the testcase I pasted above. Both gcc and llvm don't do shrink wrap, but gcc generates better code because of the splitting. So maybe we can find a better name than split_live_ranges_for_shrink_wrap in gcc.
> I think such special splitting may better be done separately before the > regular regalloc starts, since it is not compatible with the existing > splitting mechanism which is triggered by no available physreg. So far we tried to stay avoid of any form of pre-splitting to not introduce another set of heuristics. Wouldn't this be caught by some lazy code motion or post RA scheduling? (Assuming those optimizations would work at a bigger scope than basic block.) What I am saying is if this is something we can optimize after RA, I would rather go that direction. Maybe it is time to push shrink-wrapping to the next level.
> So far we tried to stay avoid of any form of pre-splitting to not introduce > another set of heuristics. Wouldn't this be caught by some lazy code motion > or post RA scheduling? > (Assuming those optimizations would work at a bigger scope than basic block.) > > What I am saying is if this is something we can optimize after RA, I would > rather go that direction. Maybe it is time to push shrink-wrapping to the > next level. I think it is much easier to do this before RA. Duing RA, still for the testcase I posted, if i is not splitted, it will be allocated to a callee-save-register, for example, rbx, since it has a use after calling goo. prologue: movq %rdi, %rbx ... body: ... movq %rbx, %rdi callq goo addq %rbx, %r15 It is harder for post RA optimization to change the generated code to the form below, since it still requires collecting global information that rdi is not changed from any path from entry to goo callsite. Post RA optimization usually tends to be simple and BB local. Another reason is that there is a great chance that post RA optimization cannot do the transformation above if RA chooses to use rdi for another live range somewhere between prologue and goo callsite. prologue: movq %rdi, %rdi ... body: ... movq %rdi, %rbx callq goo addq %rbx, %r15 However, it is more natural to achieve that by splitting i before regalloc, and then regalloc will take care of everything and generate the good code we expect. The splitting heuristic will also be simple: we only have to split i's live range to two interval. First find the smallest region covering all the callsites. then split i's live range at the boarder of the region. Since the splitting is limited for function params, it will not have very widespread impact. The interaction with regalloc is relatively small. So I believe the complexity and impact of such splitting will both be under control.