[Penalties on the 486]                                [Assembler][/][80486]

In most cases, the 486 is free from flow-dependence penalties which mean
that an instruction which uses the result of the previous instruction will
not cause a slowdown:

        add     eax,ebx
        add     ecx,eax

takes two cycles. On a Pentium, however, it takes two cycles too, but the

        add     eax,ebx
        add     ecx,edx

takes one cycle because the second instruction does not use the result of
the first so they can be 'pair'-ed. These situations are quite well
described in the application note "Intel Architecture Optimization Manual"
for released by Intel. I just want to point to one interesting thing.
Generally the 486 has two types of flow-dependence penalties:

   * Immediately using a register after its 8-bit subregister was modified.
     This applies to (this applies to (E)AX, (E)BX, (E)CX, (E)DX after AL,
     BH etc. has been changed).
   * Using a register in addressing immediately after it was modified.
     (This is valid for all registers, and beware, LEA is an addressing
     instruction). For example, how many cycles does the following code
     sequence eat (in protected mode, assuming 100% cache hit):

        add     ecx,ebp
        adc     bl,dl
        mov     al,[ebx]

On the 486 the ADD is one, the ADC is another one, but the MOV takes three
cycles even if the operand is already in the cache. Why? There is a double
penalty: One clock for using a register after it was modified (Address
Generation Interlock - AGI),; another cycle for using a register after its
subregister was modified (Flow Break). So this innocent MOV instruction
costs three cycles. I'm a smart coder, I'm gonna put an instruction between
the ADC and the MOV, and the problem is solved! Really? The

        add     ecx,ebp
        adc     bl,dl
        sub     esi,ebp
        mov     al,[ebx]

sequence takes 5 clocks: the ADD, ADC and SUB take three but the MOV takes
two because ONE cycle inserted BETWEEN the ADC and the MOV can save only
ONE penalty, not TWO. So for a perfect on clock per one instruction ratio
at least TWO instructions have to be inserted. Or, one two-cycle
instruction like SHR or even a prefixed like ADD AX,BX in 32-bit code.
                                                     Gem writer: Ervin Toth
                                                   last updated: 1998-03-16
