I was trying to trick Clang into generating optimal code for a 128-bit "count leading zeros" function. I started with the most naive version: using u128 = __uint128_t; using u64 = uint64_t; int countleadingzerosA(u128 x) { return (x >> 64) ? __builtin_clzll(u64(x >> 64)) + 64 : __builtin_clzll(u64(x)); } testq %rsi, %rsi je .LBB1_2 bsrq %rsi, %rax xorl $63, %eax orl $64, %eax retq .LBB1_2: bsrq %rdi, %rax xorl $63, %eax retq By examining the generated code (-O3), I was able to produce this version: int countleadingzerosB(u128 x) { int lo = 63 - __builtin_clzll(u64(x)) + 64; int hi = 63 - __builtin_clzll(u64(x >> 64)); return 63 - ((x >> 64) ? hi : lo); } bsrq %rdi, %rax xorl $64, %eax bsrq %rsi, %rcx testq %rsi, %rsi cmovel %eax, %ecx movl $63, %eax subl %ecx, %eax retq I believe there are two missed optimizations here that MIGHT be relatively straightforward to implement. Number 1: According to https://www.felixcloutier.com/x86/BSR.html the instruction `bsrq %rsi, %rcx` sets ZF in exactly the same way as `testq %rsi, %rsi`, so the `testq` instruction is redundant. Number 2: If dataflow can remember that the value in %rax is in the range [0..63], then it should be able to apply the same SUB-to-XOR strength reduction that it did in countleadingzerosA: we can replace the final two instructions with `xorl $63, %ecx; movl %ecx, %eax` and then the `movl` can be eliminated by register allocation. I can easily work around number 2 by just doing `return 63 ^ ...` instead of `return 63 - ...`; I was just surprised that the compiler would do SUB-to-XOR in some places but then seemingly miss it in others.
I take back number 2: brain fart on my part! `lo` is not in the range [0..63], so subtracting from 63 and XORing with 63 will give different results. And what I meant to do was XOR, so, my "workaround" is actually the only way to do it correctly. Number 1, the zero flag set by BSR, is still a legit missed optimization AFAIK.
I just fixed the zero bit on bsr in r349531. I was coincidentally wondering about that last night while I was doing some other flag investigation. We knew we could do it for BSF but not BSR.
> I just fixed the zero bit on bsr in r349531. I confirm that https://godbolt.org/z/cbZiii is perfect now. This issue can be closed FIXED as far as I'm concerned. Thanks! :)