x86 intrinsics _addcarry_u32 and _addcarry_u64 generate silly code. For example, the following function to get the result of a 64-bit addition (the XOR is to the output clearer): u64 testcarry(u64 a, u64 b, u64 c, u64 d) { u64 result0, result1; _addcarry_u64(_addcarry_u64(0, a, c, &result0), b, d, &result1); return result0 ^ result1; } This is the code generated with -O1, -O2 and -O3: xor eax, eax add al, -1 adc rdi, rdx mov qword ptr [rsp - 8], rdi setb al add al, -1 adc rsi, rcx mov qword ptr [rsp - 16], rsi xor rsi, qword ptr [rsp - 8] mov rax, rsi ret The first silliness is that _addcarry_u64 does not optimize a compile-time constant 0 being the first carry parameter. Instead of "adc", it should just use "add". The second silliness is with the use of r8b to store the carry flag, then using "add r8b, -1" to put the result back into carry. The third silliness is that _addcarry_u64 is taking its pointer parameter too literally; it shouldn't be storing values to memory at all. Instead, the code should be something like this: add rdx, rdi mov rax, rdx adc rcx, rsi xor rax, rcx ret Naturally, for something this simple, I'd use unsigned __int128, but this came up in large number math. Cross-filed with GCC (with different generated code since it's a different compiler =^-^=): https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67317
I have been working on a serie of patch to improve that situation. Review are welcome. https://reviews.llvm.org/D29541 https://reviews.llvm.org/D29528 https://reviews.llvm.org/D29524 https://reviews.llvm.org/D29443
https://godbolt.org/z/JsSJR8 With Clang trunk as of 2018-11-11, this simple "addcarry" now correctly generates optimal code, which is great: void badadd(u64& a, const u64& b) { _addcarry_u64(0, a, b, &a); } movq (%rsi), %rax addq %rax, (%rdi) ret But the equivalent code for "subborrow" generates suboptimal code: void badsub(u64& a, const u64& b) { _subborrow_u64(0, a, b, &a); } movq (%rdi), %rax subq (%rsi), %rax movq %rax, (%rdi) ret "badsub" SHOULD generate the optimal code of movq (%rsi), %rax subq %rax, (%rdi) ret
Both of these trivial functions generate suboptimal code: https://godbolt.org/z/L4nbS2 void plusequal(u64 *p, u64 x) { _addcarry_u64(0, *p, x, p); } void minusequal(u64 *p, u64 x) { _subborrow_u64(0, *p, x, p); } "plusequal" generates xorl %eax, %eax addb $-1, %al adcq %rsi, (%rdi) retq instead of addq %rsi, (%rdi) retq "minusequal" generates xorl %eax, %eax addb $-1, %al sbbq %rsi, (%rdi) retq instead of subq %rsi, (%rdi) retq
None of these look great with clang trunk at the moment (Arthur - your godbolt in Comment #2 was using gcc unfortunately....). https://godbolt.org/z/8rfUWk Adding Craig as he's been cleaning up the X86 EFLAGS support recently.
I made a fix the case in Comment 3 in r348727. I wouldn't be surprised if there are more bugs out there.
@Craig, that's awesome! I confirm that I'm seeing perfect codegen in more cases than I used to. :) Were you inviting me to submit more suboptimal cases? Here's another one where I'm trying to generate a carry-in to _subborrow_u64 and it's not quite getting it: https://godbolt.org/z/Km0CAX bool less_than_ideal(u64 xlo, u64 xhi, u64 *y) { bool cf = (xlo < y[0]); return _subborrow_u64(cf, xhi, y[1], &xhi); } Should generate: cmpq (%rdx), %rdi sbbq 8(%rdx), %rsi setb %al retq Currently generates: cmpq %rdi, (%rdx) seta %al addb $-1, %al sbbq 8(%rdx), %rsi setb %al retq
This last one seems to be due to us not being able to commute compares to get the necessary X86::COND_B https://godbolt.org/z/ZrzIfD Craig this looks like we can do this in combineCarryThroughADD?
(In reply to Simon Pilgrim from comment #7) > This last one seems to be due to us not being able to commute compares to > get the necessary X86::COND_B > > https://godbolt.org/z/ZrzIfD > > Craig this looks like we can do this in combineCarryThroughADD? Candidate Patch: https://reviews.llvm.org/D57281
Fixed [Comment #6] in rL352289 (with a followup at rL352291).