[Population count of a 32-bit register]               [Assembler][/][80386]

Here is a macro for a fast, branchless population count on a 32-bit
register. I have put this together from various pieces of information found
in News postings over the past 6 year.

POPCOUNT1 implements a branchless computation of the population count (ie.
the number of 1-bits) of a 32-bit register. It is based on a O(log(n))
algorithm that succesively groups the bits into groups of 2, 4, 8, 16, and
32, while maintaining a count of 1-bits in each group.

  1. Partition the register into groups of 2 bits. Compute the population
     count for each 2-bit group and store the result in the 2-bit group.
     This calls for the following transformation to be performed for each
     2-bit group: 00b -> 00b, 01b -> 01b, 10b -> 01b, 11b -> 10b If the
     orginal value of a 2-bit group is x, then the new value will be x - (x
     >> 1). In order to handle all 2-bit groups simultaneously, we have to
     mask appropriately to prevent spilling from one group to the next
     lower group. Thus: n = n - ((n >> 1) & 0x55555555)

  2. Now, add the population count of adjacent 2-bit groups and store the
     sum to the 4-bit group resulting from merging these adjacent 2-bit
     groups. To do this simultaneous to all groups, we mask out the odd
     numbered groups, mask out the even numbered groups, and then add the
     odd numered groups to the even numbered groups: n = (n & 0x33333333) +
     ((n >> 2) & 0x33333333) Each 4-bit field now has value 0000b, 0001b,
     0010b, 0011b, or 0100b.

  3. Now, for the first time, the value in each k-bit field is small enough
     that adding two k-bit fields results in a value that still fits in the
     k-bit field. Thus the following computation is performed:
          n = (n + (n >> 4)) & 0x0F0F0F0F
     The result is four 8-bit fields whose lower half has the correct sum
     and whose upper half contains junk that has to be masked out.
     Pictorially:

          n      = 0aaa0bbb0ccc0ddd0eee0fff0ggg0hhh
          n >> 4 = 00000aaa0bbb0ccc0ddd0eee0fff0ggg
          sum    = 0aaaWWWWiiiiXXXXjjjjYYYYkkkkZZZZ

     where WWWW, XXXX, YYYY, and ZZZZ are the interesting sums each at most
     1000b, or 8 decimal.

  4. We now have two choices of how to continue. If the target machine has
     a fast integer multiply, we can use an IMUL to add the 4-bit fields
     like so:

          0000pppp0000qqqq0000rrrr0000ssss * 00000001000000010000000100000001 =

                                  :0000pppp0000qqqq0000rrrr0000ssss
                          0000pppp:0000qqqq0000rrrr0000ssss
                  0000pppp0000qqqq:0000rrrr0000ssss
          0000pppp0000qqqq0000rrrr:0000ssss
          ---------------------------------------------------------------
          0000ppppxxxxxxxxwwwwwwww:vvvvvvvvuuuuuuuutttttttt0000ssss

     where pppp, qqqq, rrrr, and ssss are the 4-bit groups we want to add,
     and vvvvvvvvv is the sum we are interest in. If integer multiplies are
     slow, we continue reducing the number of bit groups by a factor of two
     per step until only one group remains:

       a. For the next step, adjacent 8-bit fields are added together and
          the sum is put into 16-bit fields created by merging the adjacent
          8-bit fields. Compute:

               n = n + (n >> 8)

               n      = 0000WWWW0000XXXX0000YYYY0000ZZZZ
               n >> 8 = 000000000000WWWW0000XXXX0000YYYY
               sum    = 0000WWWW000ppppp000qqqqq000rrrrr

          where ppppp and rrrrr are the interesting sums (and each is at
          most 10000b or 16 decimal). The sum qqqqq is junk, but it is not
          necessary to discard it this time.

       b. In the last step, the values of the two 16-bit groups are added,
          producing the final result in the six least significant bits. Any
          remaining junk is masked out.

               n = (n + (n >> 16)) & 0x3F

               n      = 0000WWWW000ppppp000qqqqq000rrrrr
               n >> 16= 00000000000000000000WWWW000ppppp
               sum    = 0000WWWW000ppppp000sssss00tttttt

          where sssss is at most 11000b and tttttt at most 100000b (32
          decimal).

     ;
     ; population count of a 32 bit register
     ;
     ; input:
     ;   esi = number to population count
     ;
     ; output:
     ;   edx = population count
     ;
     ; destroys:
     ;  esi, edx
     ;  eflags
     ;

     MACRO   POPCOUNT1
             mov     edx,esi
             shr     edx,1
             and     edx,055555555h          ; (n >> 1) & 0x55555555
             sub     esi,edx                 ; n - ((n >> 1) & 0x55555555)
             mov     edx,esi
             shr     edx,2                   ; n >> 2
             and     esi,033333333h          ; n & 0x33333333
             and     edx,033333333h          ; (n >> 2)  & 0x33333333
             add     esi,edx                 ; n = (n & 0x33333333) + ((n >> 2) & 0x33333333)
             mov     edx,esi
             shr     edx,4                   ; n >> 4
             add     edx,esi                 ; n + (n >> 4)
             and     edx,00F0F0F0Fh          ; n = (n + (n >> 4) & 0x0F0F0F0F)
             IFDEF FASTMUL
                     imul    edx,001010101h  ; add by multiplying with a "magic number"
                     shr     edx,24          ; shift result into place
             ENDIF
             IFNDEF FASTMUL
                     mov     esi,edx
                     shr     esi,8           ; n >> 8
                     add     edx,esi         ; n = n + (n >> 8)
                     mov     esi,edx
                     shr     esi,16          ; n >> 16
                     add     edx,esi         ; n = n + (n >> 16)
                     and     edx,00000003Fh  ; popcount
             ENDIF
     ENDM

     Note that FASTMUL can be defined if your CPU has does a multiply in
     few cycles (for example the AMD which can do a multiply in 2 cycles.
     This gem comes from the K6 optimization manual, available at AMDs
     homepage. There are a gem showing how to use this macro, see Emulate
     bit scan instructions for further information.
                                                  Gem writer: Norbert Juffa
                                                   last updated: 1998-03-16
