[NTLUG:Discuss] Re: true hardware/intelligent ATA RAID -- XScale SATA RAID solutions arrive (including PCIe)

Tue Nov 23 07:54:28 CST 2004

On Tue, 2004-11-23 at 02:52, Robert Pearson wrote:
> I'm a little confused about all these bus types. In particular the
> names and I/O rates. I did a little Googling and found the
> following---
> Older bus type standards such as ISA, EISA, PCI and PCI-X.
> Newer bus type standards such as HyperTransport, PCI Express and RIO.

One thing is to understand there are "system" interconnects and there
are "peripheral" interconnects.  "System" interconnects are _rarely_
used by end-users, they are hardwired.  "Peripheral" interconnects are
for I/O boards, as well as integrated I/O to the chipsets.

In the PC world, "system" interconnects are _under_ utilitized.  E.g.,
Intel _still_ uses "peripheral" interconnects for a lot of system
functionality (north-south, CPU to I/O, etc...).  Prior to PCI-Express
(PCIe), Intel _still_ only used a 0.25-0.5GBps 64-bit PCI connection
between the MCH (north) and ICH (south).  With PCI-Express, it now
merely uses 2-4 PCIe x1 channels.

AMD is completely "system" for _all_ chips thanx to universal
HyperTransport until you get to the actual "peripherals" (I/O
chips/boards) themselves.  Engineers "directly connect" different
HyperTransport tunnels and bridges together.  There are even mainboards
from some OEMs that use a nVidia "main" chipset (for AGP/PCI or PCIe)
alongside an AMD8131 dual-PCI-X bridge (for more traditional server/PCI
I/O).

ALi, nVidia and SiS use HyperTransport almost _entirely_, including for
the north-to-south on its Intel chipsets as well as AMD.

ViA still uses its 0.5GBps 64-bit PCI "VLink" in both its AMD and Intel
chipsets.

> "Such as" usually means there are more bus type standards but these
> are the most common or popular. I never heard of RIO. I once kept
> track of Infiniband and all its competitors but that was 4-5 years ago
> and these buses are just now making it to the marketplace. Infinband
> appear to be dead.

Infiniband is still alive for proprietary systems.  But it's a
non-native approach for Intel processors.  I could really get into it
all.

For the most part, Intel does _not_ offer a "commodity" system
interconnect like AMD HyperTransport -- at least not in quantity yet. 
Even EM64T is still 100% "front side bottleneck" AGTL+ -- which has some
_severe_ 32-bit/4GB limitations too.  This is why current Intel EM64
processors lack the equivalent of AMD64's I/O MMU, it's impossible to
implement.

On the IA-64/Itanium side, their "Scalable Node" interconnect is still a
"front-side bottleneck" approach too (although at least 53-bit ;-). 
Most 2+ processor IA-64 implementations, like those from HP, IBM, SGI
and others, use a proprietary NUMA/system interconnect.

> This is the guy who was describing the bus types---
> Brian Holden is a Principal Engineer in the Microprocessor Products
> Division at PMC-Sierra, who involves in the development and
> standardization of processor busses.
> Mr. Holden is currently the Technical Chair of the HyperTransport
> Consortium and serves on the board of the Network Processing Forum.

HyperTransport is the "commodity, universal 'system' interconnect dream"
everyone has longed for.  Not surprising, AMD didn't invent it, but it
came from API Networks, formerly Alpha Processor, Inc. (API), formerly
(to an extent) Digital Semiconductor -- the inventor of the first PCI,
AGP, 10/100 NIC and other ASIC chips.  ;->

> This is very good news. Will future change be by orders of magnitude,
> linearly, geometrically or we will just have to wait and see? Disk
> capacities sort of doubled each time they increased.

It's not the disk.  It's the "hardware RAID" controller.

Who cares if you can get 80MBps _per_ SATA drive if your "hardware ATA
RAID" uses a non-superscalar i960 logic microcontroller that can't push
more than 50-60MBps to/from local DRAM to/from peripheral bus (and
system memory).  That now puts the bottleneck on the card, not the disk.

That's where XScale comes in, offering more like 500-600MBps in local
DRAM to peripheral bus interconnect.  Now it's no longer about the card,
but the disks again.

> Did I read that right? 300 Megabytes per second?

That's just the _maximum_ of the point-to-point SATA channel.  SATA
drives are _no_faster_ than ATA drives -- they come off the same line.

50-80MBps burst is typical these days.

> Good input.

Yeah.  Most ATA vendors state right on the model specifications that
they are only rated for 14x5 operation with a 1 year warranty.  Many
vendors are now selling OEMs and system integrators an "enterprise"
model ATA drive that is tested 24x7 for at least 3 years.  These are
typically the same drives, but they tested to higher tolerances than the
"consumer" ones sold cheaper.

> This means that PCI-Express is changing its name to PCle?

No.  That's the abbreviated name.  That's all.
  PCI-Express = PCIe

> Do you believe I can achieve five nines "99999" of Information High
> Availability with ATA RAID

Yes.  As long as you use spares or, possibly, newer approaches like
RAID-6 (2 parity) or RAID-51 (mirrored RAID-5).

> or FRAID?

No.  FRAID is 100% software RAID, even if it comes in a card or
on-mainboard.  Once the 32/64-bit OS boots, the RAID logic is 100% in
hardware.  FRAID cards do _not_ have an on-board intelligence at all.

For software RAID-1 or RAID-10 mirroring, you're pumping 2x as much data
over the memory-IO interconnect.

For software RAID-3, 4 or 5, you're pumping N-1 times (where N = number
of disks) as much data over the CPU-memory-IO system interconnect -- and
that's now 2 hops!  I.e., in a 5 disc RAID-3/4/5, you end 4 copies of
_all_ data not only to system memory, but then from system memory to
CPU.  It's not the XOR that kills the CPU, it's the data copy through 2
hops of the system interconnect that ties it up unnecessarily.  Hence it
is the equivalent interconnect overhead over using a 8 disc RAID-1/10!

For software RAID-6 or 51, you're now talking an effective overhead of
4*(N-1).  So for a 5 disc RAID-6/51, you're talking 16 times as much
system interconnect utilization versus a 8 disc RAID-1/10.

> How much dispersion or how many mirrors do I need to achieve that?

That's indepenedent of the ASIC v. XScale v. Software.

At this point, I'm recommend 2 mirror or parity with ATA, or at least 1
mirror/parity with 1 standby.

> Or does it take Xscale SATA RAID?

Again, that's independent of the argument.  Using an XScale intelligent
on-card is _only_ a _performance_ consideration, not a reliability one. 
Although it _does_ mean it _can_ rebuilt RAID-1/10/51/6 volumes faster
than software/FRAID.

My mention of ATA RAID products finally shipping XScale on-board was
basically a mention of "yeah, now we _finally_ have microcontroller-
based hardware ATA RAID that can compete with 3Ware ASIC approach to
hardware ATA RAID."  Before XScale, the i960/IOP30x were _slouches_
designed for the '90s.

> The Xscale SATA RAID performance at an ATA price would be very nice. I
> have read that SATA drives are not as robust as ATA which surprised
> me.

Same difference, exact same different.  ATA and SATA drives come off the
same line, tested to the same tolerances.

Unless you mean the "enterprise" models.  Then those are still typically
the exact same drives, but tested to higher tolerances, and priced as
such.

> Good idea. Which bus would support which functions?

It's not really a matter of "throughput."
It's really more of a matter of "cost v. compatibility."

PCI-X is backward compatible, but is more costly.
PCI-Express (PCIe) is not backward compatible, but x1 runs are cheap.

You will typically only find PCI-X in dual-processor mainboards that
start at $300+.

You will soon find PCIe x1 channels in sub-$100 mainboards.

BTW, PCIe x4 is really the "equivalent" of a single 133MHz PCI-X slot. 
It is even rarer, and found more on the $300+ mainboards.

Unless, of course, you use your x16 graphics slot for x4 I/O.  That's an
option I guess.

> Amen. I hope I live to see this.
> How many Megabytes per second would 
> you expect to see from this configuration?

3Ware has been claming over 250MBps in its RAID-0 writes/reads, 5 reads,
and over 200MBps in RAID-10 writes/reads and 125MBps in RAID-5 writes
with its 64-bit, non-blocking I/O ASIC+SRAM approach on its 0.5GBps
(64-bit PCI @ 66MHz) Escalade products.

Again, anything that uses an old i960/IOP30x probably couldn't break
60MBps.  But with XScale, the bottleneck is now back on the disks.  So
beyond that, when it comes to the _bus_, it doesn't matter.

God knows that _anything_ is better than ultra-crappy, 0.125GBps
"shared" PCI.  You can't put a cheap audio card on that shared PCI bus
these days if you have a recent hard drive or GbE network.  That's why
people are getting "latency" (stutter/breaking) in audio output because
disk and network are bursting through beyond 50MBps.

Dedicated PCI-X and PCIe channels address that.  So it, again, becomes a
matter of the on-board intelligence and disc transfer rates.

> How about we just bypass the bus all-together? Let's just use direct
> optical links to the high speed destination-source pairs?

Er, first off, optical doesn't give you anything over copper here.

Secondly, "direct links" to what?  Main memory?  Now we have to abitrate
that, or make the memory dual-ported.

In a nutshell, here's the flow in a storage device:  

  system memory -> system interconnect
    system interconnect -> peripheral interconnect
      peripheral interconnect -> storage intelligence
        storage intelligence -> disc

If your peripheral interconnect is a bottleneck, like 133MBps
(0.125GBps) "shared" PCI is, then it doesn't matter what your card can
do.

If your storage intelligence is a bottleneck, like a 40-60MBps
i960/IOP30x processor, it doesn't matter what your bus is.

> We should be able to design a mobo that allows that.

Impossible in Intel Xeon/AGTL+ or IA-64/SNA.

Now I _guess_ you _could_ create a HyperTransport tunnel/bridge for
AMD.  But HyperTransport is a "system" interconnect, not a "peripheral"
one.  I.e., it would have to be fixed to a mainboard.

Considering the maximum throughput really achievable today with 12 disc,
non-blocking I/O ATA is maybe 500MBps (0.5GBps), that's still an order
of magnitude less than the typical 5GBps+ of system interconnects.

> Instead of having TOE bus cards we have TOE chips in the set.

Same difference.  The "set" is just a "peripheral" bus arbitrator to the
"system" interconnect.  I fail to see how this would work.

Unless you somehow get Intel to radically change AGTL+.  They haven't in
11 years and it looks like they are adopting IA-64/SCN in an "unified
socket" for Itanium and Xeon circa 2006.

And in the case of HyperTransport, you're still not even talking 1/10th
of what it is capable of.  Storage belongs on the peripheral bus.  About
the _only_ thing that could be argued for inclusion on the "system" bus
is video.

> Maybe we should think vertical on mobo's?

Again, I'm trying to figure out what your saying?  A block diagram would
help -- fire up OpenDraw and send it to me.

Otherwise, the 0.25-1.0GBps performance of PCIe x1-4 and 66-133MHz PCI-X
is quite sufficient for even a 12-disc, RAID-0 volume.

-- Bryan

-- 
Bryan J. Smith                                    b.j.smith at ieee.org 
-------------------------------------------------------------------- 
Subtotal Cost of Ownership (SCO) for Windows being less than Linux
Total Cost of Ownership (TCO) assumes experts for the former, costly
retraining for the latter, omitted "software assurance" costs in 
compatible desktop OS/apps for the former, no free/legacy reuse for
latter, and no basic security, patch or downtime comparison at all.