[Performance monitoring]                            [Assembler][/][Pentium]

This gem presents a small, yet quite effective method of measuring the
cycles needed to execute a piece of code. This gem utilises the RDTSC
instruction.

Beginning with the Pentium processor, it is possible to access the
time-stamp counter. The time-stamp counter keeps an accurate count of every
cycle executed. The time-stamp counter is a 64-bit MSR (model specific
register) that is incremented every clock cycle. On reset, the time-stamp
counter is set to zero. Accessing the counter is done by the RDTSC
instruction (read time-stamp counter). The instruction returns the low
cycle count in EAX and high cycle count in EDX. The RDTSC returns the
number of cycles executed, not the time taken to execute them. To convert
cycles to time use this formula (frequency given in Hz):

        time = cycles / frequency

Since the counter may overflow, especially on faster processors, the
package uses the full 64-bit count.
The Pentium Pro and Pentium II processors support out-of-order execution
instructions may be executed in another order as you programmed them. This
can be a source of errors if not taken care of.
To prevent this the programmer must serialize the the instruction queue.
This can be done by inserting a serializing instruction like CPUID
instruction before the RDTSC instruction. There is however a problem with
this. The CPUID instruction itself takes some time to execute. The soultion
here is to measure the exection time of CPUID and subtract if from the
cycle count returned by RDTSC.
A strange thing about CPUID is that it may longer time to execute the first
couple of times it is called. The best thing to do is call the instruction
three times and measure the third call. This is utilised in the code below.

How to use the package

The package must be included into your code segment. Please note the data
last in the package, that must be placed in a data segment.
Then, if your CPU is a Pentium Pro or a Pentium II, PProPII must be defined
(this is because the package must then use serializing to prevent
out-of-order execution of RDTSC).
The package must be initialized by:

        call    monitor_init

The you call call the macros like this:

;
; some other piece of code

        time_start

; the code you may want to measure
;                .
;                .
;                .

        time_stop
        mov     [mycountlow],eax
        mov     [mycounthigh],edx

This was a simple example of the package. The above example does compensate
for cache effects (code/data not beeing in cache). If cache effects is not
wanted you must "pretouch&qout; the data, simply by just reading it. Then
just call the package several times to take care of the code cache:

;
; some other piece of code

        mov     ecx,4           ; execute test code 4 times
meassureloop:
        push    ecx

        time_start

; the code you may want to measure
;                .
;                .
;                .

        time_stop
        pop     ecx
        mov     [mycount_low+ecx*4],eax
        mov     [mycount_high+ecx*4],edx
        dec     ecx
        jnz     meassureloop

Note: The mycount variables must (in this example) be arrays of doublewords
with 4 indexes.
Also note that data used by a section of code should be placed together to
minimize cache effects.

Performance monitoring package

It is supposed to run under plain DOS (no EMM and similar) since the may
interrupt the process. Also the RDTSC is a priveleged instruction and does
not run in CPL 3. This is not really a problem since a real monitoring
session should be performed in an enviroment where the program isn't
interrupted since that would mess up the cycle count. A wise thing to do
would also be inserting a CLI right before the time_start instruction to
prevent all types of interrupts.
Here is the actual monitoring package:

;
; Performance monitoring package
;
; define PProPII if your CPU is a Pentium Pro or a Pentium II
;
; implements:
;
;   monitor_init
;     initializes the package
;
;   time_start
;     start cycle count here
;
;   time_stop
;     stop counting here
;
; note:
;   the package can not do nested measurements, since the macro
;   returns all cycles in the same variable
;

;
; define cpuid and rdtsc instructions via macros
; this is not necessary is your assembler supports them
;
MACRO   cpuid
        db      0fh,0a2h
ENDM

MACRO   rdtsc
        db      0fh,031h
ENDM

;
; monitor_init:
;
; input:
;   nothing
;
; output:
;   cpuid_cycle = initialized to exection time of cpuid
;
; destroys:
;   nothing
;

monitor_init:

IFDEF PProPII
        pushfd
        pushad

        mov     ecx,3
getcpuidtime:
        cpuid
        rdtsc
        mov     [cycle],eax
        cpuid
        rdtsc
        sub     eax,[cycle]
        mov     [cpuid_cycle],eax
        dec     ecx
        jnz     getcpuidtime

        popad
        popf
ENDIF
        ret

;
; time_start - start timing point here
;
; input:
;   none
;
; output:
;   time_cycles initialized
;
; destroys:
;   eax, ebx, ecx, edx
;   eflags
;

MACRO   time_start
IFDEF PProPII
        cpuid
ENDIF
        rdtsc
        mov     [time_cycles],eax
        mov     [time_cycles+4],edx
ENDM

;
; time_stop - stop timing point here
;
; input:
;   none
;
; output:
;   eax = low cycle count
;   edx = high cycle count
;
; destroys:
;   eax, ebx, ecx, edx
;   eflags
;

MACRO   time_stop
IFDEF PProPII
        cpuid
ENDIF
        rdtsc
        sub     eax,[time_cycles]
        sbb     edx,[time_cycles+4]

IFDEF ProPII
        sub     eax,[cpuid_cycle]
        sbb     edx,0
ENDIF

ENDM

;
; place the following data in your data segment
;

time_cycles     dq      ?
cycle           dd      ?
cpuid_cycle     dd      ?

                                                  Gem writer: John Eckerdal
                                                   last updated: 1998-06-06
