[Fast BSF replacement]                                [Assembler][/][80386]

Macro EMBSF5 emulates the BSF instruction for non-zero argument

This macro utilizes an algorithm published in the news group comp.arch by
Robert Harley in 1996. The algorithm converts the problem of finding the
position of the least significant set bit into a bit counting problem. By
computing x^(x-1), where x is the original input argument, a right-aligned
group of 1s is created, whose cardinality equals the position of the least
significant 1-bit plus 1.

The input x is of the form (regular expression): {x}n1{0}m. x-1 has the
form {x}n{0}(m+1), and x^(x-1) has the form {0}n{1}(m+1). This step is
pretty similar to the one used by macro PREPBSF, only that PREPBSF creates
a right-aligned group of 1s whose cardinality equals exactly the position
of the least significant 1-bit.

Harley's algorithm then employs a special method to count the number of
bits in the right-aligned block of 1s. I am not sure upon which number
theoretical argument it is founded, and it wasn't explained in the news
group post. According to Harley, if a 32-bit number of the form
00...01...11 is multiplied by the "magic" number (7*255*255*255), then bits
<31:26> of the result uniquely identify the number of 1s in that number. A
64 entry table is used to map that unique result to the bit count.

Here, I have modified the table to reflect the bit position of the least
significant 1-bit in the original argument, which is one less than the bit
count of the intermediate result. I have tested the algorithm exhaustively
for all 2^32-1 possible inputs.

Place the following table in the data segment:

table   db      0, 0, 0,15, 0, 1,28, 0,16, 0, 0, 0, 2,21,29, 0
        db      0, 0,19,17,10, 0,12, 0, 0, 3, 0, 6, 0,22,30, 0
        db     14, 0,27, 0, 0, 0,20, 0,18, 9,11, 0, 5, 0, 0,13
        db     26, 0, 0, 8, 0, 4, 0,25, 0, 7,24, 0,23, 0,31, 0

And here follows the actual macro:

;
; emulate bsf instruction
;
; input:
;   eax = number to preform a bsf on ( != 0 )
;
; output:
;   edx = result of bsf operation
;
; destroys:
;   ecx
;   eflags
;

MACRO   EMBSF5

        mov     edx,eax         ; do not disturb original argument
        dec     edx             ; n-1
        xor     edx,eax         ; n^(n-1), now EDX = 00..01..11

IFDEF FASTMUL
        imul    edx,7*255*255*255       ; multiply by Harley's magic number
ELSE
        mov     ecx,edx         ; do multiply using shift/add method
        shl     edx,3
        sub     edx,ecx
        mov     ecx,edx
        shl     edx,8
        sub     edx,ecx
        mov     ecx,edx
        shl     edx,8
        sub     edx,ecx
        mov     ecx,edx
        shl     edx,8
        sub     edx,ecx
ENDIF

        shr     edx,26          ; extract bits <31:26>
        movzx   edx,[table+edx] ; translate into bit count - 1

ENDM

Note: FASTMUL can be defined if your CPU has a fast integer multiplicator,
like the AMD. The IMUL replacement should run in about 8-9 cycles.
                                                  Gem writer: Norbert Juffa
                                                   last updated: 1998-03-16
