[Fast strlen()]                                       [Assembler][/][80386]

Fast implementation of strlen()

Recently, someone wrote to me with the comment that strlen() is a very
commonly called function, and as such was interested in possible
performance improvements for it. At first, without thinking too hard about
it, I didn't see how there was any opportunity to fundamentally improve the
algorithm. I was right, but as far as low level algorithmic scrutiny is
concerned, there is plenty of opportunity. Basically, the algorithm is byte
scan based, and as such the typical thing that the C version will do wrong
is miss the opportunity to reduce load redundancy.

;
; fast strlen()
;
; input:
;   eax = offset to string
;
; output:
;   ecx = length
;
; destroys:
;   ebx
;   eflags
;

        lea     ecx,[eax-1]
l1:     inc     ecx
        test    ecx,3
        jz      l2
        cmp     [byte ptr ecx],0
        jne     l1
        jmp     l6
l2:     mov     ebx,[ecx]       ; U
        add     ecx,4           ;   V
        test    bl,bl           ; U
        jz      l5              ;   V
        test    bh,bh           ; U
        jz      l4              ;   V
        test    ebx,0ff0000h    ; U
        jz      l3              ;   V
        test    ebx,0ff000000h  ; U
        jnz     l2              ;   V +1brt
        inc     ecx
l3:     inc     ecx
l4:     inc     ecx
l5:     sub     ecx,4
l6:     sub     ecx,eax

Here, I've sacrificed size for performance, by essentially unrolling the
loop 4 times. If the input strings are fairly long (which is when
performance will matter) on a Pentium, the asm code will execute at a rate
of 1.5 clocks per byte, while the C compiler takes 3 clocks per byte.
Note: This routine can be used on lower CPUs too, just use 16-bit
registers. But then it may not be the fastest implementation.
                                                     Gem writer: Paul Hsieh
                                                   last updated: 1998-06-07
