[NTLUG:Discuss] CPU History

Wed Jan 8 23:45:09 CST 2003

This is like Chinese to me!
Care to explain this it for a layman???

-----Original Message-----
From: David Simmons [mailto:dsimmons at powersmiths.com]
Sent: Thursday, January 09, 2003 12:50 AM
To: discuss at ntlug.org
Subject: [NTLUG:Discuss] CPU History

Thought you might like this 'mini-history' lesson of CPU design from a
computer engineer down in Florida....- Dave

------------------------------------------------

Brian Ashe pointed out this PBS-Cringely article to me:
http://www.pbs.org/cringely/pulpit/pulpit20021226.html

In it, he states:
    "When Intel introduced the 486 processor, it came with
     a 100 percent increase in the number of instructions
     executed per clock cycle compared to the 386 processor
     that preceded it.  When the Pentium came along, it
     offered a 90 percent increase in instructions per clock
     cycle.  The Pentium Pro had a 40 percent increase in
     instructions per clock cycle over the Pentium.  The
     Pentium II and III offered only a 20 percent improvement
     over the Pentium Pro.  And the Pentium 4 is worst of all,
     rated at 10 percent FEWER instructions per clock cycle
     than the Pentium III."

In looking at the "major design changes" of the processors, I can tell you
_why_
this is true.

i486:  50-100% speed increase over i386

- 18 month "refit" of i386
  (still scalable to ~100MHz+)
- BIGGIE:  Integrated L1 Cache
- Integrated FPU (floating point unit)

GTL (P5: Pentium):  -30 to +75% speed increase over i486

- Designed in ~5 years**
- **New 5-issue superscalar, pipelined design
  (upto -5% due to more scalable, pipelined design)
  (upto another -25% due to some poor ALU/fetch design though)
- Now scalable to ~300MHz+
- Out-of-order execution
- Branch prediction
- New pipelined FPU
- BIGGIE:  66MHz FSB (2x improvement over i486)
- BIGGIE:  Split 8+8KB instruction/data L1 Cache

GTL-MMX (P55: pentium):  10 to 25% speed increase over P5
- BIGGIE:  Split 16+16KB instuction/data L1 Cache
- MMX instructions added integer SIMD (audio/2D-imaging)

GTL+ (P6: PPro-II/Kalamath):  -5 to +50% speed increase over P5
- Designed in ~3 years**
- **New 7-issue superscalar design
  (-5% due to some extension of pipes to almost 10 stages)
- Now scalable to ~1GHz+
- 2-issue FPU now 1 complex + 1 ADD-only (but not complex+ADD simul)
- BIGGIE:  64-bit FSB (2x improvement over Pentium)
- L2 cache controller on-chip
- Upto 1MB L2 on-package (either Pro/split-die, II/Slot)
- New, redesigned ALU/fetch logic

GTL+L2 (P6+L2: II-III):  -5 to +25% speed increase over P6

- 12 month "refit" of P6 (still scalable to 1GHz)
  (no change in scalability)
- 100MHz FSB (50% improvement over P5/P6)
- L2 cache now "on-die" (not on-package in Pro/II)
  (-5% -- due to only upto 256KB on non-Xeon)
- Late:  SSE instructions added lossy float SIMD equivalent to AMD 3DNow!

P6-4 (Pentium 4):  -10 to -50% speed increase (i.e. decrease) over P6-L2
[ _Always_ slower MHz for MHz than P3 ]

- 18-month "incomplete/partial" redesign of P6 core**
- Extension of existing P6, 7-issue superscalar design
  (**-25 to -50% due to extension of pipes to almost 20 stages
   with absolutely _no_ redesign, unlike prior Intel core changes)
- Now scalable to ~5GHz+
- Dedicated 4x32-bit SSE "lossy math" unit
- PRO-CON:  100MHz Quad-Pumped Rambus FSB
  (4x improvement in throughput (good), but 4x increase in latency(BAD!))
- L1 Cache Redesign:  8KB data + 8,000 instructions (still ~32KB)
- New SSE2 instructions created for 100% marketing reasons

**NOTE:  Superscalar design is _tough_ to optimize.  The reason why pipes
are
extended is so engineering "timing closure."  This usually _forces_ a good
3+
years of "major redesign" of the core.  Intel did _not_ do this with the
Pentium
IV, hence the result.

Compare that with AMD's _current_ (not past) lineage:

NexGen Nx586:
- 8-year "clean room" design
- **Innovative 4-issue superscalar, fixed 32-bit ISA RISC core design
  ("RISC86" for variable bit CISC i386 compatibility)
- Scalable to ~150MHz+
- Branch prediction
- Out-of-order execution
- Proprietary 33-50MHz, 33-bit FSB
- 32KB L1 cache on-chip
- On-chip L2 controller

NexGen Nx586FP / AMD K5 (late models):  No change (FPU added)
- Now 5-issue with FPU, but FPU non-pipelined
- K5:  Intel GTL/P5 compatible

NexGen Nx686 / AMD K6:  10 to 30% speed increase over Nx586/K5
- Designed in ~3 years**
- **6-issue superscalar, fixed 32-bit ISA RISC core design
- Advanced ALU/fetch (3x performance of Pentium)
- Massive branch prediction redesign
  (super-accurate 97-98%, but overkill/20% of real-estate! )
- 2-issue FPU now 1 complex + 1 ADD/MULT (simul-okay)
  (but FPU still not pipelined -- used w/"Pentium errata fix" workaround)
- Scalable to ~500MHz+
- BIGGIE:  Split 32+32KB instruction/data L1 cache
- MMX support
- Late:  Native 3DNow! for lossy _and_ lossless 4x32-bit FPU support
- Later:  128-256KB on-die L2 cache

Athlon:  -10 to +100% speed increase over K6
- Designed in ~3 years**
- Heavily influenced by mass exodous of Digital Alpha chip designers
  (after Intel buyout)
- **9-issue superscalar, fixed 32-bit ISA RISC core design
  (-10% due to stage extensions to near almost 20, but redesign
   to compensate _unlike_ P4)
- BIGGIE:  3-issue FPU 2 complex + ADD/MULT (simul-okay), fully pipelines
- Register renaming
- Scalable to ~3GHz+
- BIGGIE:  Alpha EV6 non-symmetric, point-to-point interconnect
- 64-bit x 100MHz DDR FSB (300% increase over K6)
- BIGGIE:  Split 64+64KB instruction/data L1 cache
- 512KB L2 "on-package" cache (EV6 allows upto 8MB)

Athlon Thunderbird/Thoroughbred:  -10 to +20% increase over Athlon
- "Repackage" of Athlon
- BIGGIE:  256KB on-die L2 cache
- 133Hz DDR FSB option
- TBred:  Microcoding of SSE instructions

Athlon Thoroughbred-A/Barton:  10 to +30% increase over TBird/TBred
- 12 month "refit" of Athlon
- New 9-layer redesign/optimization
- Barton:  512KB L2 cache
- 166-200MHz DDR FSB option
- Microcoding of SSE2 instructions

Hammer:  25 to +50% increase over TBred-A/Barton
- 64-bit transistion of Athlon RISC86 core in ~2 years
- Still 9-issue, but with **64-bit opcodes and extra registers
  (aka "x86-64" -- doubles sizes of ALU to 64-bit, FPU to 128-bit/media)
- BIGGIE:  1-2 166-200MHz DDR (2.7-3.2GBps) memory controllers on-die
  (1 on low-end, 2 on high-end/MP -- memory is _dedicated_ to CPU)
- BIGGIE:  1-3 32+32/400MHz DDR (6.4GBps) HyperTransport interconnects
on-die
  (1 on low-end, 3 on high-end/MP -- interconnect is direct to other
CPUs/IO)
- 256-1MB L2 cache (256KB low-end, 1MB high-end)

--
Bryan J. Smith, E.I. (BSECE)       Contact Info:  http://thebs.org
[ http://thebs.org/files/resume/BryanJonSmith_certifications.pdf ]
------------------------------------------------------------------
Microsoft states Linux's GPL is "viral" so I guess all the authors
in the US who require you to pay royalties to print their books
must be the digital "black plague."  Copyright is copyright and
the GPL prevents commercial use without a license from the holder.

_______________________________________________
https://ntlug.org/mailman/listinfo/discuss