[Copying data using the FPU]                  [Assembler][/][Pentium][+FPU]

This gem show you how you can use your FPU to copy data between memory
locations? The following loop can be used for block memory copying. I don't
know who was the original developer of this kind of loop, but it has been
presented in various documents. This version comes from Agner Fogs excelent
Pentium optimization manual

;
; copying data using the fpu
;
; input:
;   esi = source
;   edi = destination
;   ecx = number of 16-byte chunks to move
;
; output:
;   none (data from esi is copied to edi)
;
; destorys:
;   esi, edi, ecx
;   flags, fp flags
;

topofloop:
        fild    qword ptr [esi]
        fild    qword ptr [esi+8]
        fxch
        fistp   qword ptr [edi]
        fistp   qword ptr [edi+8]
        add     esi,16
        add     edi,16
        dec     ecx
        jnz     topofloop

The loop is optimal on (a fast) Pentium when both the source and
destination are aligned on 64-bit boundaries and the destination is not in
the cache. (Additionally the loop can be optimal on PPro if the destination
does not permit write-combining.)
If the destination is in the cache (or the destination memory permits write
combining on PPro) then REP MOVSD will be faster.
The loop is faster than REP MOVSD, because it does half as many writes to
external memory (with the noted exceptions). External memory is usually
very slow compared to the execution time of the loop. Consequently after a
few iterations of the loop the write buffers of the CPU become filled and
subsequent iterations of the loop will execute at the speed of external
memory. For small memory blocks you should use a simple DWORD copy loop,
because the overhead of the FPU copy loop is much higher than that of most
other memory copy loops.
You might think that you should use FLD/FSTP instead of FILD/FISTP.
Unfortunately FLD/FSTP would not work very well, because all 64-bit values
are not normal floating point values. The handling of denormal floating
point numbers is very slow.
But it's eve worse. Denormals (see notes) make the FLD/FSTP copying slow,
but it will still be functionally correct. But, if the data represents an
SNAN (see notes), it will be quietly converted to a QNAN (see notes) if IE
is masked (CW.IM = 1), or you will get an exception if IE is unmasked
(CW.IM = 0).
Therefore one should really forget about trying to use FLD/FSTP for memory
copy loops.

For related information see Agner Fog's Pentium optimization manual (you
can find it at http://announce.com/agner/assem and Intel's Pentium Pro
developer's manual volume 3 for information on write buffers, caches,
write-combining etc... (it can be found at Intel's developer WWW site).

notes:
SNANs are all the numbers where bits <62:52> = 7FFh, and bit <51> = 0 and
bits <50:0> !=0. An SNAN is converted to a QNAN by setting bit<51>.

Denormals are numbers when the exponent field has all bit set to 0 and the
mantissa is non-zero. Or in the copy process the bits 62-52 (exponent
field) of each aligned 64-bit entitiy is zero.

                                               Gem writer: (code) Agner Fog
                                                       (text) Vesa Karvonen
                                                   (comments) Norbert Juffa
                                                   last updated: 1998-03-16
