This memcpy() code is optimized for all AMD Athlon and Duron family
processors.  This includes Athlon XP, Athlon MP, Athlon 4 (mobile),
and Duron.  The code uses MMX and prefetchnta instructions, and employs
"non temporal" memory writes on large blocks, which bypass the cache
for better efficiency.  For large blocks, it uses the Block Prefetch
technique.

This code typically provides significantly improved performance.
Performance gains are dependent on particular system specs, including
CPU speed, CPU type, chip set, main memory type, and main memory speed.
The data block size and alignment are also factors.   Developers should
test their applications to determine their exact performance benefit.

The application code should make sure that it's running on an AMD
Athlon or Duron or other appropriate processor before executing this
optimized memcpy().  MMX, PREFETCHNTA and MOVNTQ must be supported by
the CPU. The standard library memcpy() should be called when running on
other processors.

The optimized memcpy is called memcpy_amd() to avoid naming conflicts
with the standard memcpy().  There are 2 versions of the code included:

memcpy_amd.cpp
This code is written using inline assembly language, for Microsoft
Visual C++ 6.0 with the Processor Pack, or later edition (such as Visual
Studio.NET) which supports the instruction set extensions.

memcpy_amd.asm
This is a pure assembly language version, using MASM syntax.