New user self-registration is disabled due to spam. For an account please email bugs-admin@lists.llvm.org with your e-mail address and full name.

Bug 31810 - llvm.bitreverse.i8 on x86 lowers to questionable assembly
Summary: llvm.bitreverse.i8 on x86 lowers to questionable assembly
Status: NEW
Alias: None
Product: libraries
Classification: Unclassified
Component: Backend: X86 (show other bugs)
Version: trunk
Hardware: PC All
: P normal
Assignee: Konstantin Belochapka
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-01-31 03:58 PST by Gonzalo BG
Modified: 2018-03-23 20:34 PDT (History)
5 users (show)

See Also:
Fixed By Commit(s):


Attachments
8 bit bitreverse performance measurment (3.05 KB, text/x-c++src)
2017-07-17 16:59 PDT, Konstantin Belochapka
Details
8 bit bit reverse performance measurement updated (6.77 KB, text/plain)
2017-07-24 17:25 PDT, Konstantin Belochapka
Details
8 bit bit reversal performance measurement in cpu clocks (16.59 KB, text/plain)
2017-07-24 17:35 PDT, Konstantin Belochapka
Details
yet another benchmark application for 8 bit bit reversal algorithms (15.55 KB, text/plain)
2018-03-23 20:34 PDT, Konstantin Belochapka
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Gonzalo BG 2017-01-31 03:58:59 PST
On x86 the 32 and 64 bit versions of bitreverse look very good, but the generated code for the 8-bit version looks worse than the naive implementation:

https://godbolt.org/g/kAlOC7

@bitrev_8_naive(unsigned char)
        mov     eax, 2149582850
        imul    rax, rdi
        movabs  rcx, 36578664720
        and     rcx, rax
        movabs  rax, 4311810305
        imul    rax, rcx
        shr     rax, 32
        ret

vs llvm.bitreverse.i8:

@bitrev_8(unsigned char)
        rol     dil, 4
        mov     eax, edi
        and     al, 51
        shl     al, 2
        and     dil, -52
        shr     dil, 2
        or      dil, al
        mov     eax, edi
        and     al, 85
        add     al, al
        and     dil, -86
        shr     dil
        or      dil, al
        mov     eax, edi
        ret
Comment 1 Gonzalo BG 2017-01-31 06:03:16 PST
FWIW the naive version contains 7 instructions vs 14 of the generated version. In my machine (1.8 GHz Intel Core i5) the naive version is ~2x faster: 585ns/iter vs 1214ns/iter with a variance of 200ns/iteration.
Comment 2 Hans Wennborg 2017-03-13 15:42:50 PDT
I'm not sure I'd call your version "naive", it's pretty clever, though I suppose it's a well-known technique.

X86 has some custom lowering for certain bit reverses with SSSE3 or XOP, but none of that is firing here.

I think the code we're seeing here is coming from SelectionDAGLegalize::ExpandBITREVERSE(). It's probably not hard to add the clever code for i8 there.
Comment 3 Simon Pilgrim 2017-03-14 04:58:37 PDT
Adding the naive method would be very easy (and I've been meaning to investigate the performance of using PSHUFB for i16/i32/i64 for some time similar to what is done on XOP).

My only question is whether you guys think the naive method should only be used for x86_64 targets or any target that supports legal i64 multiplies?
Comment 4 Gonzalo BG 2017-03-14 05:07:28 PDT
> I'm not sure I'd call your version "naive", it's pretty clever, though I suppose it's a well-known technique.

Oh sure, sorry about that, well-known is indeed a better term than naive. 

I just checked the 8, 32, and 64 bit versions of bitreverse, but it might be worth checking the generated assembly of the 16 and 128bit versions as well. 

> My only question is whether you guys think the naive method should only be used for x86_64 targets or any target that supports legal i64 multiplies?

ARM has rbit instructions (32bit in >= ARMv7, and 64bit in >= ARMv8) that might be faster than these "hacks" on some ARM targets. I cannot say upfront whether the 8bit version here is faster on ARMv7 than doing the reversal on a 32bit register using rbit, nor I can say whether the answer would be the same for ARMv8 or future ARM architectures, so... I would guess that this is for a cost model to decide.

Also, x86_64 doesn't have an rbit instruction _yet_, but who knows, maybe some day BMI 3.0 arrives with an rbit instruction that is faster than the naive version.
Comment 5 Simon Pilgrim 2017-03-14 07:16:25 PDT
That's fine - I was talking about legalization only, any target that already has a legal/custom way to lower BITREVERSE for a type shouldn't be affected.
Comment 6 Konstantin Belochapka 2017-07-17 16:59:16 PDT
Created attachment 18811 [details]
8 bit bitreverse performance measurment

code compares performance for five different 8 bit bit reverse implementations:
1. clang builtin
2. integer multiplications 
3. xor and rotate 
4. swap bits
5. table lookup
Comment 7 Konstantin Belochapka 2017-07-17 18:08:25 PDT
I did performance measurement for different 8bit bit reverse implementation.
The result is following (i7-6900 3.2Gz):

measuring for 8bit bitreverse bitrev_8_builtin
secs:nsecs: 1:857424417
measuring for 8bit bitreverse bitrev_8_mul
secs:nsecs: 1:690300201
measuring for 8bit bitreverse bitrev_8_rotate
secs:nsecs: 2:98403306
measuring for 8bit bitreverse bitrev_8_lgN
secs:nsecs: 1:680574415
measuring for 8bit bitreverse bitrev_8_lookup
secs:nsecs: 1:405522728

so, the judging on performance,
on the first place is of course the table lookup bitreverse.

on the second place is bitreverse lgN:

bitrev_8_lgNh:
1	movl	%edi, %eax
2	shrb	%al
3	andb	$85, %al
4	addb	%dil, %dil
5	andb	$-86, %dil
6	orb	%al, %dil
7	movl	%edi, %eax
8	shrb	$2, %al
9	andb	$51, %al
10	shlb	$2, %dil
11	andb	$-52, %dil
12	orb	%al, %dil
13	rolb	$4, %dil
14	movl	%edi, %eax

bitreverse that uses multiplication is on 3 place, but it uses lesser number of machine instructions. And obviously the MUL instruction timing is 
varies greatly for one CPU to another.  

bitrev_8_mul:
1	movl	$2149582850, %eax       # imm = 0x80200802
2	imulq	%rdi, %rax
3	movabsq	$36578664720, %rcx      # imm = 0x884422110
4	andq	%rax, %rcx
5	movabsq	$4311810305, %rax       # imm = 0x101010101
6	imulq	%rcx, %rax
7	shrq	$32, %rax

Our current builtin is on the 4 place despite the fact that is based on
lgN algorithm.
 
bitrev_8_builtin:
1	rolb	$4, %dil
2	movl	%edi, %eax
3	andb	$51, %al
4	shlb	$2, %al
5	andb	$-52, %dil
6	shrb	$2, %dil
7	orb	%al, %dil
8	movl	%edi, %eax
9	andb	$85, %al
10	addb	%al, %al
11	andb	$-86, %dil
12	shrb	%dil
13	orb	%al, %dil
14	movl	%edi, %eax

As a conclusion, I implemented builtin_bitreverse8
based on shorter, manually optimized xor and rotate bit reverse algorithm.

1	movl	%edi, %ecx
2	andb	$85, %cl
3	xorb	%cl, %dil
4	rolb	$2, %cl
5	orb	%dil, %cl
6	movl	%ecx, %eax
7	andb	$102, %al
8	xorb	%al, %cl
9	rolb	$4, %al
10	orb	%cl, %al
11	rolb	%al

it is 3 instruction shorter than the current bit reverse builtint and gives not very significant but some performance advantage.
Comment 8 Gonzalo BG 2017-07-19 00:36:45 PDT
So I've repeated the benchmarks on my machine (1.8 GHz Intel Core i5, clang -Ofast -DNDEBUG):

measuring for 8bit bitreverse bitrev_8_builtin
secs:nsecs: 2:796269000
measuring for 8bit bitreverse bitrev_8_mul
secs:nsecs: 2:758230000
measuring for 8bit bitreverse bitrev_8_rotate
secs:nsecs: 3:333305000
measuring for 8bit bitreverse bitrev_8_lgN
secs:nsecs: 2:371445000
measuring for 8bit bitreverse bitrev_8_lookup
secs:nsecs: 2:788676000

which shows a slightly different trend, from fastest to slowest: 
[lgN, mul, lookup, builtin, rotate].

> As a conclusion, I implemented builtin_bitreverse8
based on shorter, manually optimized xor and rotate bit reverse algorithm.

Given that lgN performed very good in your system as well, I think it might make sense to use lgN when optimizing for speed, and mul when optimizing for code size.
Comment 9 Konstantin Belochapka 2017-07-24 17:25:15 PDT
Created attachment 18840 [details]
8 bit bit reverse performance measurement updated

updated version of 8 bit reversal algorithm performance measurement.
more precise result, less noise, asm inlines to prevent aggressive clang optimizations
Comment 10 Konstantin Belochapka 2017-07-24 17:35:59 PDT
Created attachment 18841 [details]
8 bit bit reversal performance measurement in cpu clocks

another performance measurement application for 8 bit bit reversal algorithm,
it uses RDTSC instruction to get a value of a CPU time stamp register and CPUID for serialization. For each measured sample it computes variance and standard deviance.
Comment 11 Konstantin Belochapka 2017-07-24 17:53:01 PDT
results that I got form updated benchmark (i7-6900):

measuring time for 8bit bitreverse bitrev_8_empty, iterations: 100000000
total  time (secs:nsecs): 2:452109938
average time per loop (nsecs): 24, 10 call per loop
measuring time for 8bit bitreverse bitrev_8_loop, iterations: 100000000
total  time (secs:nsecs): 7:56755478
average time per loop (nsecs): 70, 10 call per loop
measuring time for 8bit bitreverse bitrev_8_builtin, iterations: 100000000
total  time (secs:nsecs): 4:722134246
average time per loop (nsecs): 47, 10 call per loop
measuring time for 8bit bitreverse bitrev_8_mul, iterations: 100000000
total  time (secs:nsecs): 4:590241129
average time per loop (nsecs): 45, 10 call per loop
measuring time for 8bit bitreverse bitrev_8_rotate, iterations: 100000000
total  time (secs:nsecs): 4:658893959
average time per loop (nsecs): 46, 10 call per loop
measuring time for 8bit bitreverse bitrev_8_lgN, iterations: 100000000
total  time (secs:nsecs): 4:748335384
average time per loop (nsecs): 47, 10 call per loop
measuring time for 8bit bitreverse bitrev_8_lookup, iterations: 100000000
total  time (secs:nsecs): 3:988580325
average time per loop (nsecs): 39, 10 call per loop


Result from rdtsc-bench:

[0] measuring clocks for 8 bit bitreverse bitrev_8_empty
------------------------------------------------------------------------


   total number of spurious min values = 0
   variance of minimum values          = 0
   variance of average values          = 0
   variance of maximum values          = 234830314
   average of minimum values           = 0, standard deviance = 0
   average of average values           = 0, standard deviance = 0
   average of maximum values           = 6144, standard deviance = 15324

[0] measuring clocks for 8 bit bitreverse bitrev_8_loop
------------------------------------------------------------------------


   total number of spurious min values = 0
   variance of minimum values          = 0
   variance of average values          = 1
   variance of maximum values          = 596016522
   average of minimum values           = 3, standard deviance = 0
   average of average values           = 18, standard deviance = 1
   average of maximum values           = 25560, standard deviance = 24413

[0] measuring clocks for 8 bit bitreverse bitrev_8_builtin
------------------------------------------------------------------------


   total number of spurious min values = 0
   variance of minimum values          = 0
   variance of average values          = 9
   variance of maximum values          = 537886145
   average of minimum values           = 3, standard deviance = 0
   average of average values           = 9, standard deviance = 3
   average of maximum values           = 18827, standard deviance = 23192

[0] measuring clocks for 8 bit bitreverse bitrev_8_mul
------------------------------------------------------------------------


   total number of spurious min values = 22
   variance of minimum values          = 0
   variance of average values          = 1
   variance of maximum values          = 548319231
   average of minimum values           = 2, standard deviance = 0
   average of average values           = 5, standard deviance = 1
   average of maximum values           = 18860, standard deviance = 23416

[0] measuring clocks for 8 bit bitreverse bitrev_8_rotate
------------------------------------------------------------------------


   total number of spurious min values = 0
   variance of minimum values          = 0
   variance of average values          = 9
   variance of maximum values          = 501794384
   average of minimum values           = 3, standard deviance = 0
   average of average values           = 8, standard deviance = 3
   average of maximum values           = 16474, standard deviance = 22400

[0] measuring clocks for 8 bit bitreverse bitrev_8_lgN
------------------------------------------------------------------------


   total number of spurious min values = 0
   variance of minimum values          = 0
   variance of average values          = 11
   variance of maximum values          = 528092785
   average of minimum values           = 3, standard deviance = 0
   average of average values           = 10, standard deviance = 3
   average of maximum values           = 18202, standard deviance = 22980

[0] measuring clocks for 8 bit bitreverse bitrev_8_lookup
------------------------------------------------------------------------


   total number of spurious min values = 28
   variance of minimum values          = 13
   variance of average values          = 7
   variance of maximum values          = 424783162
   average of minimum values           = 3, standard deviance = 3
   average of average values           = 7, standard deviance = 2
   average of maximum values           = 12182, standard deviance = 20610



It looks like it does not make much sense to change the current bitreverse
builtin implementation, performance benefit is either non exiting or very marginal.
Comment 12 Gonzalo BG 2018-03-14 03:14:26 PDT
What's the conclusion here? Non-issue won't fix? 

How come the differences were so big in the initial benchmarks, but so small in the later ones?
Comment 13 Konstantin Belochapka 2018-03-23 20:31:57 PDT
Conclusion is: the current builtin_bitreverse8() is good enough and its performance balanced across various platforms and there is a little reason to change it. On some platforms IMUL bit reverse alghoritm is a little bit faster, but on others platforms IMUL (ps4 for example) is much slower.
I have added yet another benchmark application, this time I used inline assembler to prevent compiler optimizations, and also for better accuracy this benchmark uses batches instead of multiple calls to a single function.

When running on Linux Xeon E5-1620 3.7Gz:

measuring time for: ::: asm_bitrev_8_nop_100() :::, iterations: 100000
total time (nanoseconds): 26878270
average time per loop (nanoseconds): 268, 10 calls per loop
---------------------------------------------------------------
---------------------------------------------------------------
measuring time for: ::: asm_bitrev_8_lgN_100() :::, iterations: 100000
total time (nanoseconds): 274648219
average time per loop (nanoseconds): 2746, 10 calls per loop
---------------------------------------------------------------
---------------------------------------------------------------
measuring time for: ::: asm_bitrev_8_lookup_100() :::, iterations: 100000
total time (nanoseconds): 35146944
average time per loop (nanoseconds): 351, 10 calls per loop
---------------------------------------------------------------
---------------------------------------------------------------
measuring time for: ::: asm_bitrev_8_loop_100() :::, iterations: 100000
total time (nanoseconds): 392665441
average time per loop (nanoseconds): 3926, 10 calls per loop
---------------------------------------------------------------
---------------------------------------------------------------
measuring time for: ::: asm_bitrev_8_rotate_100() :::, iterations: 100000
total time (nanoseconds): 246278077
average time per loop (nanoseconds): 2462, 10 calls per loop
---------------------------------------------------------------
---------------------------------------------------------------
measuring time for: ::: asm_bitrev_8_mul_100() :::, iterations: 100000
total time (nanoseconds): 202603630
average time per loop (nanoseconds): 2026, 10 calls per loop
---------------------------------------------------------------
---------------------------------------------------------------
measuring time for: ::: bitrev_8_builtin_100() :::, iterations: 100000
total time (nanoseconds): 287725718
average time per loop (nanoseconds): 2877, 10 calls per loop
---------------------------------------------------------------
 


When running on PS4:
---------------------------------------------------------------
measuring time for: ::: asm_bitrev_8_nop_100() :::, iterations: 100000
total time (nanoseconds): 96728000
average time per loop (nanoseconds): 967, 10 calls per loop
---------------------------------------------------------------
---------------------------------------------------------------
measuring time for: ::: asm_bitrev_8_lgN_100() :::, iterations: 100000
total time (nanoseconds): 631655000
average time per loop (nanoseconds): 6316, 10 calls per loop
---------------------------------------------------------------
---------------------------------------------------------------
measuring time for: ::: asm_bitrev_8_lookup_100() :::, iterations: 100000
total time (nanoseconds): 102356000
average time per loop (nanoseconds): 1023, 10 calls per loop
---------------------------------------------------------------
---------------------------------------------------------------
measuring time for: ::: asm_bitrev_8_loop_100() :::, iterations: 100000
total time (nanoseconds): 1933014000
average time per loop (nanoseconds): 19330, 10 calls per loop
---------------------------------------------------------------
---------------------------------------------------------------
measuring time for: ::: asm_bitrev_8_rotate_100() :::, iterations: 100000
total time (nanoseconds): 568562000
average time per loop (nanoseconds): 5685, 10 calls per loop
---------------------------------------------------------------
---------------------------------------------------------------
measuring time for: ::: asm_bitrev_8_mul_100() :::, iterations: 100000
[SceShellUI] I/PSM.UI : [NoDraw] : ClearFrameBuffer=False
total time (nanoseconds): 975727000
average time per loop (nanoseconds): 9757, 10 calls per loop
---------------------------------------------------------------
---------------------------------------------------------------
measuring time for: ::: bitrev_8_builtin_100() :::, iterations: 100000
total time (nanoseconds): 694057000
average time per loop (nanoseconds): 6940, 10 calls per loop
---------------------------------------------------------------


As you can see, on Xeon E5, IMUL_BITREVERSE is better than BUILTIN_BITREVERSE,
but on PS4 it is worse.

So, the current builtin_bitreverse() which is based on LogN algorithm,is good enough. It might be changed on a platform that has dedicated instruction for bits reversal.
Comment 14 Konstantin Belochapka 2018-03-23 20:34:30 PDT
Created attachment 20116 [details]
yet another benchmark application for 8 bit bit reversal algorithms

clang++ -O1 -std=c++14 bench.cpp -o bench