On x86 the 32 and 64 bit versions of bitreverse look very good, but the generated code for the 8-bit version looks worse than the naive implementation: https://godbolt.org/g/kAlOC7 @bitrev_8_naive(unsigned char) mov eax, 2149582850 imul rax, rdi movabs rcx, 36578664720 and rcx, rax movabs rax, 4311810305 imul rax, rcx shr rax, 32 ret vs llvm.bitreverse.i8: @bitrev_8(unsigned char) rol dil, 4 mov eax, edi and al, 51 shl al, 2 and dil, -52 shr dil, 2 or dil, al mov eax, edi and al, 85 add al, al and dil, -86 shr dil or dil, al mov eax, edi ret
FWIW the naive version contains 7 instructions vs 14 of the generated version. In my machine (1.8 GHz Intel Core i5) the naive version is ~2x faster: 585ns/iter vs 1214ns/iter with a variance of 200ns/iteration.
I'm not sure I'd call your version "naive", it's pretty clever, though I suppose it's a well-known technique. X86 has some custom lowering for certain bit reverses with SSSE3 or XOP, but none of that is firing here. I think the code we're seeing here is coming from SelectionDAGLegalize::ExpandBITREVERSE(). It's probably not hard to add the clever code for i8 there.
Adding the naive method would be very easy (and I've been meaning to investigate the performance of using PSHUFB for i16/i32/i64 for some time similar to what is done on XOP). My only question is whether you guys think the naive method should only be used for x86_64 targets or any target that supports legal i64 multiplies?
> I'm not sure I'd call your version "naive", it's pretty clever, though I suppose it's a well-known technique. Oh sure, sorry about that, well-known is indeed a better term than naive. I just checked the 8, 32, and 64 bit versions of bitreverse, but it might be worth checking the generated assembly of the 16 and 128bit versions as well. > My only question is whether you guys think the naive method should only be used for x86_64 targets or any target that supports legal i64 multiplies? ARM has rbit instructions (32bit in >= ARMv7, and 64bit in >= ARMv8) that might be faster than these "hacks" on some ARM targets. I cannot say upfront whether the 8bit version here is faster on ARMv7 than doing the reversal on a 32bit register using rbit, nor I can say whether the answer would be the same for ARMv8 or future ARM architectures, so... I would guess that this is for a cost model to decide. Also, x86_64 doesn't have an rbit instruction _yet_, but who knows, maybe some day BMI 3.0 arrives with an rbit instruction that is faster than the naive version.
That's fine - I was talking about legalization only, any target that already has a legal/custom way to lower BITREVERSE for a type shouldn't be affected.
Created attachment 18811 [details] 8 bit bitreverse performance measurment code compares performance for five different 8 bit bit reverse implementations: 1. clang builtin 2. integer multiplications 3. xor and rotate 4. swap bits 5. table lookup
I did performance measurement for different 8bit bit reverse implementation. The result is following (i7-6900 3.2Gz): measuring for 8bit bitreverse bitrev_8_builtin secs:nsecs: 1:857424417 measuring for 8bit bitreverse bitrev_8_mul secs:nsecs: 1:690300201 measuring for 8bit bitreverse bitrev_8_rotate secs:nsecs: 2:98403306 measuring for 8bit bitreverse bitrev_8_lgN secs:nsecs: 1:680574415 measuring for 8bit bitreverse bitrev_8_lookup secs:nsecs: 1:405522728 so, the judging on performance, on the first place is of course the table lookup bitreverse. on the second place is bitreverse lgN: bitrev_8_lgNh: 1 movl %edi, %eax 2 shrb %al 3 andb $85, %al 4 addb %dil, %dil 5 andb $-86, %dil 6 orb %al, %dil 7 movl %edi, %eax 8 shrb $2, %al 9 andb $51, %al 10 shlb $2, %dil 11 andb $-52, %dil 12 orb %al, %dil 13 rolb $4, %dil 14 movl %edi, %eax bitreverse that uses multiplication is on 3 place, but it uses lesser number of machine instructions. And obviously the MUL instruction timing is varies greatly for one CPU to another. bitrev_8_mul: 1 movl $2149582850, %eax # imm = 0x80200802 2 imulq %rdi, %rax 3 movabsq $36578664720, %rcx # imm = 0x884422110 4 andq %rax, %rcx 5 movabsq $4311810305, %rax # imm = 0x101010101 6 imulq %rcx, %rax 7 shrq $32, %rax Our current builtin is on the 4 place despite the fact that is based on lgN algorithm. bitrev_8_builtin: 1 rolb $4, %dil 2 movl %edi, %eax 3 andb $51, %al 4 shlb $2, %al 5 andb $-52, %dil 6 shrb $2, %dil 7 orb %al, %dil 8 movl %edi, %eax 9 andb $85, %al 10 addb %al, %al 11 andb $-86, %dil 12 shrb %dil 13 orb %al, %dil 14 movl %edi, %eax As a conclusion, I implemented builtin_bitreverse8 based on shorter, manually optimized xor and rotate bit reverse algorithm. 1 movl %edi, %ecx 2 andb $85, %cl 3 xorb %cl, %dil 4 rolb $2, %cl 5 orb %dil, %cl 6 movl %ecx, %eax 7 andb $102, %al 8 xorb %al, %cl 9 rolb $4, %al 10 orb %cl, %al 11 rolb %al it is 3 instruction shorter than the current bit reverse builtint and gives not very significant but some performance advantage.
So I've repeated the benchmarks on my machine (1.8 GHz Intel Core i5, clang -Ofast -DNDEBUG): measuring for 8bit bitreverse bitrev_8_builtin secs:nsecs: 2:796269000 measuring for 8bit bitreverse bitrev_8_mul secs:nsecs: 2:758230000 measuring for 8bit bitreverse bitrev_8_rotate secs:nsecs: 3:333305000 measuring for 8bit bitreverse bitrev_8_lgN secs:nsecs: 2:371445000 measuring for 8bit bitreverse bitrev_8_lookup secs:nsecs: 2:788676000 which shows a slightly different trend, from fastest to slowest: [lgN, mul, lookup, builtin, rotate]. > As a conclusion, I implemented builtin_bitreverse8 based on shorter, manually optimized xor and rotate bit reverse algorithm. Given that lgN performed very good in your system as well, I think it might make sense to use lgN when optimizing for speed, and mul when optimizing for code size.
Created attachment 18840 [details] 8 bit bit reverse performance measurement updated updated version of 8 bit reversal algorithm performance measurement. more precise result, less noise, asm inlines to prevent aggressive clang optimizations
Created attachment 18841 [details] 8 bit bit reversal performance measurement in cpu clocks another performance measurement application for 8 bit bit reversal algorithm, it uses RDTSC instruction to get a value of a CPU time stamp register and CPUID for serialization. For each measured sample it computes variance and standard deviance.
results that I got form updated benchmark (i7-6900): measuring time for 8bit bitreverse bitrev_8_empty, iterations: 100000000 total time (secs:nsecs): 2:452109938 average time per loop (nsecs): 24, 10 call per loop measuring time for 8bit bitreverse bitrev_8_loop, iterations: 100000000 total time (secs:nsecs): 7:56755478 average time per loop (nsecs): 70, 10 call per loop measuring time for 8bit bitreverse bitrev_8_builtin, iterations: 100000000 total time (secs:nsecs): 4:722134246 average time per loop (nsecs): 47, 10 call per loop measuring time for 8bit bitreverse bitrev_8_mul, iterations: 100000000 total time (secs:nsecs): 4:590241129 average time per loop (nsecs): 45, 10 call per loop measuring time for 8bit bitreverse bitrev_8_rotate, iterations: 100000000 total time (secs:nsecs): 4:658893959 average time per loop (nsecs): 46, 10 call per loop measuring time for 8bit bitreverse bitrev_8_lgN, iterations: 100000000 total time (secs:nsecs): 4:748335384 average time per loop (nsecs): 47, 10 call per loop measuring time for 8bit bitreverse bitrev_8_lookup, iterations: 100000000 total time (secs:nsecs): 3:988580325 average time per loop (nsecs): 39, 10 call per loop Result from rdtsc-bench: [0] measuring clocks for 8 bit bitreverse bitrev_8_empty ------------------------------------------------------------------------ total number of spurious min values = 0 variance of minimum values = 0 variance of average values = 0 variance of maximum values = 234830314 average of minimum values = 0, standard deviance = 0 average of average values = 0, standard deviance = 0 average of maximum values = 6144, standard deviance = 15324 [0] measuring clocks for 8 bit bitreverse bitrev_8_loop ------------------------------------------------------------------------ total number of spurious min values = 0 variance of minimum values = 0 variance of average values = 1 variance of maximum values = 596016522 average of minimum values = 3, standard deviance = 0 average of average values = 18, standard deviance = 1 average of maximum values = 25560, standard deviance = 24413 [0] measuring clocks for 8 bit bitreverse bitrev_8_builtin ------------------------------------------------------------------------ total number of spurious min values = 0 variance of minimum values = 0 variance of average values = 9 variance of maximum values = 537886145 average of minimum values = 3, standard deviance = 0 average of average values = 9, standard deviance = 3 average of maximum values = 18827, standard deviance = 23192 [0] measuring clocks for 8 bit bitreverse bitrev_8_mul ------------------------------------------------------------------------ total number of spurious min values = 22 variance of minimum values = 0 variance of average values = 1 variance of maximum values = 548319231 average of minimum values = 2, standard deviance = 0 average of average values = 5, standard deviance = 1 average of maximum values = 18860, standard deviance = 23416 [0] measuring clocks for 8 bit bitreverse bitrev_8_rotate ------------------------------------------------------------------------ total number of spurious min values = 0 variance of minimum values = 0 variance of average values = 9 variance of maximum values = 501794384 average of minimum values = 3, standard deviance = 0 average of average values = 8, standard deviance = 3 average of maximum values = 16474, standard deviance = 22400 [0] measuring clocks for 8 bit bitreverse bitrev_8_lgN ------------------------------------------------------------------------ total number of spurious min values = 0 variance of minimum values = 0 variance of average values = 11 variance of maximum values = 528092785 average of minimum values = 3, standard deviance = 0 average of average values = 10, standard deviance = 3 average of maximum values = 18202, standard deviance = 22980 [0] measuring clocks for 8 bit bitreverse bitrev_8_lookup ------------------------------------------------------------------------ total number of spurious min values = 28 variance of minimum values = 13 variance of average values = 7 variance of maximum values = 424783162 average of minimum values = 3, standard deviance = 3 average of average values = 7, standard deviance = 2 average of maximum values = 12182, standard deviance = 20610 It looks like it does not make much sense to change the current bitreverse builtin implementation, performance benefit is either non exiting or very marginal.
What's the conclusion here? Non-issue won't fix? How come the differences were so big in the initial benchmarks, but so small in the later ones?
Conclusion is: the current builtin_bitreverse8() is good enough and its performance balanced across various platforms and there is a little reason to change it. On some platforms IMUL bit reverse alghoritm is a little bit faster, but on others platforms IMUL (ps4 for example) is much slower. I have added yet another benchmark application, this time I used inline assembler to prevent compiler optimizations, and also for better accuracy this benchmark uses batches instead of multiple calls to a single function. When running on Linux Xeon E5-1620 3.7Gz: measuring time for: ::: asm_bitrev_8_nop_100() :::, iterations: 100000 total time (nanoseconds): 26878270 average time per loop (nanoseconds): 268, 10 calls per loop --------------------------------------------------------------- --------------------------------------------------------------- measuring time for: ::: asm_bitrev_8_lgN_100() :::, iterations: 100000 total time (nanoseconds): 274648219 average time per loop (nanoseconds): 2746, 10 calls per loop --------------------------------------------------------------- --------------------------------------------------------------- measuring time for: ::: asm_bitrev_8_lookup_100() :::, iterations: 100000 total time (nanoseconds): 35146944 average time per loop (nanoseconds): 351, 10 calls per loop --------------------------------------------------------------- --------------------------------------------------------------- measuring time for: ::: asm_bitrev_8_loop_100() :::, iterations: 100000 total time (nanoseconds): 392665441 average time per loop (nanoseconds): 3926, 10 calls per loop --------------------------------------------------------------- --------------------------------------------------------------- measuring time for: ::: asm_bitrev_8_rotate_100() :::, iterations: 100000 total time (nanoseconds): 246278077 average time per loop (nanoseconds): 2462, 10 calls per loop --------------------------------------------------------------- --------------------------------------------------------------- measuring time for: ::: asm_bitrev_8_mul_100() :::, iterations: 100000 total time (nanoseconds): 202603630 average time per loop (nanoseconds): 2026, 10 calls per loop --------------------------------------------------------------- --------------------------------------------------------------- measuring time for: ::: bitrev_8_builtin_100() :::, iterations: 100000 total time (nanoseconds): 287725718 average time per loop (nanoseconds): 2877, 10 calls per loop --------------------------------------------------------------- When running on PS4: --------------------------------------------------------------- measuring time for: ::: asm_bitrev_8_nop_100() :::, iterations: 100000 total time (nanoseconds): 96728000 average time per loop (nanoseconds): 967, 10 calls per loop --------------------------------------------------------------- --------------------------------------------------------------- measuring time for: ::: asm_bitrev_8_lgN_100() :::, iterations: 100000 total time (nanoseconds): 631655000 average time per loop (nanoseconds): 6316, 10 calls per loop --------------------------------------------------------------- --------------------------------------------------------------- measuring time for: ::: asm_bitrev_8_lookup_100() :::, iterations: 100000 total time (nanoseconds): 102356000 average time per loop (nanoseconds): 1023, 10 calls per loop --------------------------------------------------------------- --------------------------------------------------------------- measuring time for: ::: asm_bitrev_8_loop_100() :::, iterations: 100000 total time (nanoseconds): 1933014000 average time per loop (nanoseconds): 19330, 10 calls per loop --------------------------------------------------------------- --------------------------------------------------------------- measuring time for: ::: asm_bitrev_8_rotate_100() :::, iterations: 100000 total time (nanoseconds): 568562000 average time per loop (nanoseconds): 5685, 10 calls per loop --------------------------------------------------------------- --------------------------------------------------------------- measuring time for: ::: asm_bitrev_8_mul_100() :::, iterations: 100000 [SceShellUI] I/PSM.UI : [NoDraw] : ClearFrameBuffer=False total time (nanoseconds): 975727000 average time per loop (nanoseconds): 9757, 10 calls per loop --------------------------------------------------------------- --------------------------------------------------------------- measuring time for: ::: bitrev_8_builtin_100() :::, iterations: 100000 total time (nanoseconds): 694057000 average time per loop (nanoseconds): 6940, 10 calls per loop --------------------------------------------------------------- As you can see, on Xeon E5, IMUL_BITREVERSE is better than BUILTIN_BITREVERSE, but on PS4 it is worse. So, the current builtin_bitreverse() which is based on LogN algorithm,is good enough. It might be changed on a platform that has dedicated instruction for bits reversal.
Created attachment 20116 [details] yet another benchmark application for 8 bit bit reversal algorithms clang++ -O1 -std=c++14 bench.cpp -o bench