[Opimise for MC68k CPUs]                            [Assembler][/][MC68000]

The Motorola 68K family consists of a wide range of members from the
micro-coded MC68000 to the super-scalar, hard-wired MC68060. This text will
discuss the integer performance of the 68K family and ways to optimize both
for the individual processors as well as produce code that runs well on any
68K processor.

In general, optimizations can take place on three different levels:

  1. On the assembly level for code that is written in assembly language.
  2. On the compiler level if the application is written in a high level
     language (i.e. C).
  3. On the user level by changing algorithms.

We will discuss what can be done to optimize 68K code concerning level 1.

The 68K family includes the following members:

     MC68000
          First generation 68K processor.
          16 bit internal/external data paths.
          16 Mb address space.

     MC68008
          8 bit external data path.
          1-4 MB address space.

     MC68010
          Similar to MC68000, but with restartable instructions.
          Can be used in a virtual memory environment.
          Loop mode.

     MC68EC000
          Low-power MC68000.
          8 or 16 bit external data bus.

     MC68020
          32 bit virtual memory microprocessor.
          32 bit internal/external data paths.
          4 GB address space.
          Can be used with floating point coprocessor.
          New instructions added including bitfield instructions.
          New addressing modes added.
          256 bytes instruction cache.

     MC68EC020
          16 Mb address space.

     MC68030
          Similar to MC68020 but slightly faster.
          256 bytes data cache added. On-chip MMU.

     MC68EC030
          Low-power MC68030. No MMU.

     CPU32
          Basically a 68020 core but without cache, bitfield instructions
          and memory indirect addressing modes.
          16 bit external data path.
          No coprocessor.
          CPU32+ Same as CPU32 but with 32 bit external data path.

     MC68040
          Third generation 32 bit processor.
          4K instruction cache.
          4K data cache.
          On chip floating point processor.
          On chip MMU.
          Most instructions take one cycle.

     MC68EC040
          Low-power MC68040.
          No MMU.
          No FPU.

     MC68060
          Super scalar implementation of the 68K architecture.
          Can issue up to two instructions per cycle.
          8K instruction cache.
          8K data cache.

     MC68EC060
          Similar to MC68060.
          No FPU.
          No MMU.

The following table summarizes the characteristics of the different members
in the 68000 family:
    Processor  Cache  Register Add Memory AddMulIndexBranchUAcc  HWFP
      68000     None        6          18     40  18 10 / 6 no    no
      68020   256 / 0       2          6      28  9   6 / 4 yes 68881/2
      68030  256 / 256      2          5      28  8   6 / 4 yes 68881/2
      CPU32     None        2          9      16  12  8 / 4 no    no
      68040   4 K/4 K       1          1      16  3   2 / 3 yes   yes
      68060   8 K/8 K       1          1      2   1   0 / 1 yes   yes

            Register AddRegister to register 32 bit add (ADD.L
                        D0,D1)

            Memory Add  Absolute long address to register add
                        (ADD.L _MEM,D1)

            Mul         16 x 16 multiplication (max. time)
                        (MULU.W D0,D1)

            Index       Indexed addressing mode (MOVE.L
                        2(A0,D0),D1)

            Branch      Byte conditional branch taken / not
                        taken (BNE.B Label)

            UAcc        Unaligned access allowed (MOVE.L
                        0xFFFF0001,D1)
            HWFP        Hardware floating point support

When optimizing for the 68K family, we divide the members into the
following groups:

     68000
          Optimize for the following processors:
          MC68000/10,MC68008/MC68EC000
     68020
          Optimize for the following processors:
          MC68020/30,MC68EC020/30,CPU32/CPU32+
     68040
          Optimize for the following processors: MC68040/MC68EC040
     68060
          Optimize for the following processors: MC68060/MC68EC060
     680xx
          Optimize so the code will execute reasonably on any 68K
          processor.

Since optimizations for one 68K processor can make another one execute
slower, it is fairly important to know the individual instruction timings
for each member. Here are some examples of different ways of doing
operations and the preferred method for each 68K processor:

   * Operations with long immediate values between -128 and 127:
          A  add.l #20,d1           B  moveq.l #20,d0
                                       add.l d0,d1

          A  68040/xx               B  68000/20/60

   * Byte/word operations that could be replaced with long operations:
          A  add.w d0,d1            B  add.l d0,d1

          A  68000/20/40/xx         B  68020/40/60

   * Keep memory operands in registers:
          A  add.l _var,d1          B  move.l _var,d0
             add.l _var,d2             add.l d0,d1
                                       add.l d0,d2

          A  68040 (as long as      B  68000/20/60/xx
             total number of
             instructions are less)

   * Reschedule operations using address registers:
          A  add.l d0,d1            B  move.l (a1),a0
             move.l (a1),a0            add.l d0,d1
             move.l (a0),d2            move.l (a0),d2

          A  68000/20               B  680xx

   * Replace constant multiplications with adds/subs/shifts:
          A  mulu.w #254,d1         B  move.l d1,d0
                                       lsl.l #8,d1
                                       lsl.l #1,d0
                                       sub.l d0,d1

          A  68060                  B  68000/20/40/xx

   * Operations using indexing modes:
          A  add.l (a0,d7),d1       B  add.l d7,a0
             add.l (a0,d7),d2          add.l (a0),d1
                                       add.l (a0),d2

          A  68000/60               B  68020/40/xx

   * Saving/restoring registers:
          A  movem.l d4-d7,-(a7)    B  move.l d7,-(a7)
                                       move.l d6,-(a7)
                                       move.l d5,-(a7)
                                       move.l d4,-(a7)

          A  68000/20/60/xx         B  68040 (if time
                                       critical)

Summary of characteristics for each processor:
68000:

   * Lacks 68020 instruction extensions:
          No extb.l instruction
          No 32 bit multiply
          No scaled indexing mode
          No 32 bit PC relative branches
   * Use short instructions
   * Keep values in registers
   * No scheduling necessary
   * Code optimized for 68020 or 68060 runs great

68020:

   * Use short instructions
   * Keep values in registers
   * Almost no scheduling necessary
   * Code optimized for the 68060 runs great

68040:

   * Use as few instructions as possible (even if they are longer)
   * Values can be kept in memory
   * Avoid pipe-line stalls for some effective addresses
   * Avoid subtracts to address registers

68060:

   * Use short instructions
   * Keep values in registers
   * Schedule instructions for superscalar execution
   * Inline short functions

680xx:

   * If the code is to be executed on a 68000 processor, the 68000
     instruction subset must be used.
   * Avoid bitfield instructions.
   * Align all data.
   * Schedule the instructions for an 68060.
   * Avoid complex addressing modes (memory indirect).

Note: This text was a bit longer, covering all 3 levels of optimisation,
but since this is a MC68k specific text the rest has been cut. The text has
also been slightly modified. The original author is unknown.
                                                  Gem writer: John Eckerdal
                                                   last updated: 1998-03-16
