Recent Changes - Search:
NTLUG

Linux is free.
Life is good.

Linux Training
10am on Meeting Days!

1825 Monetary Lane Suite #104 Carrollton, TX

Do a presentation at NTLUG.

What is the Linux Installation Project?

Real companies using Linux!

Not just for business anymore.

Providing ready to run platforms on Linux

10 Gbit NFS

With the advent of affordable 10 Gbit ethernet, iSCSI has become a viable Direct Attach (DAS) or Storage Area Network (SAN) solution. But what about Network Attached Storage (NAS)? In particular, does 10 Gbit offer significant performance for networked storage protocols like the Network File System (NFS) that is commonly used by heterogeneous systems including most *ix platforms including Linux?

10 Gbit Throughput

In theory the common ethernet speeds for LANs breaks down like this performance wise (in MB/s since we're talking ultimately about storage performance):

10Base-X100Base-X1000Base-X10GBase-X
10 Mbit100 Mbit1000 Mbit10000 Mbit
1.16 MB/s11.6 MB/s125 MB/s1250 MB/s

So... theoretically speaking, without overhead, the best we can do network wise unidirectionally is 1250 MB/s. So how fast is our network really?

Our 10 Gbit Network Speed

Using iperf 2.0.4 over many, many tests, the results always produced:

 # iperf -t 60 -c myserver
 ------------------------------------------------------------
 Client connecting to myserver, TCP port 5001
 TCP window size: 16.0 KByte (default)
 ------------------------------------------------------------
 [  3] local 192.168.1.2 port 57561 connected with 192.168.1.3 port 5001
 [ ID] Interval       Transfer     Bandwidth
 [  3]  0.0-60.0 sec  65.6 GBytes  9.39 Gbits/sec

So, for our network, the theoretical max is ~1174 MB/s. The reason why it is a bit lower than 10 Gbit is because our network does not use Jumbo Frames of size 9000, but rather the default value of 1500 bytes per frame. Even at 1 Gbit speeds, using jumbo frames is preferred, but because our network is heterogeneous, we have chosen to not modify the frame size. Unfortunately, at 10 Gbit speeds, this small inefficiency due to frame size turns into a measurable performance loss. Even so, as the earlier chart shows, we can expect significant performance improvements by using 10 Gbit. Note: even with jumbo frames, our network will likely not completely reach theoretical maximum rate.

Testing 10 Gbit NFS

Hardware Specifications

Server

Client

Storage

ProLiant BL460c G6 Blade
2xX5550 Xeon
48G Memory
SLES 11 SP1 (2010/07/29)
300G LV off a 1.46TB RAID60
2 x SATABoy (see below)
NFS export= rw, async

ProLiant BL460c G6 Blade
2xX5550 Xeon
4G Memory
SLES 11

Nexsan SATABoy
14 x 1TB 7200rpm SATA RAID6
2 x 2GB cache (mirror config, 1G logical size)

The storage is limited by a 4 Gbit (500 MB/s) Fibre Channel infrastructure. However we are striping across two 4 Gbit pathways.

NFS Specification

The number of nfsd's on the server platform has been raised to 128 (default is 4? It's something pretty low.).

With regards to rsize and wsize, we are NOT setting those. Since both the client and server are new enough, NFS on Linux will supposedly automatically negotiate the largest values of those parameters when they are not set. Linux NFS now sets the maximum of these to 1MB (in earlier 2.6 kernels and older, the maximum was 32K).

Performance with NFS sync turned on is not all that different EXCEPT where metadata, file creates, deletes are concerned. Then the performance impact is huge. Therefore, like most commercial high end NFS based NAS systems, we have enabled the async option on the server.

Bonnie++, Good and Yet Hated

Bonnie++ is one of the best and easiest to run overall disk performance tests created. Unlike its predecessor (bonnie), bonnie++ does a good job of presenting results that test the actual storage and filesystem rather than relying on cache to skew the numbers. So by default, bonnie++ will use data sizes that are twice as large as main memory. In our case, we have slightly less than the 4 GB that we requested in our client. Thus bonnie++ will chose ~7G for its data loads.

 $ cd /raid60
 $ bonnie
 Version 1.01d       ------Sequential Output------ --Sequential Input- --Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
 Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
 myclient         7G 282026  97 549537  35 337134  44 319974  99 1140086  53  1545   2
                     ------Sequential Create------ --------Random Create--------
                     -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                  16  4812  23 +++++ +++  5041  13  4823  16 +++++ +++  4943  12

My own personal experience with bonnie++ is that it does a really good job of showing the top performance of a device and/or filesystem and I find the Sequential Output Rewrite value to be indicative of overall mixed read/write performance.

So... why the hate? I can only guess that due to its predecessor and/or due to some early bug, that bonnie++ has received a bad reputation. With that said, we'd like to use a test that others might accept (even though, bonnie++ is showing us reasonable data here, notice 1140 MB/s on Block Seq. Input).

Enter Iozone, Good (but NOT by default) and Accepted

Unlike bonnie++, iozone does not attempt to remove cache from the equation by default. With that said, I find that most iozone results I've found show people doing exactly that, which means there is a LOT of bad iozone data results out there.

To avoid cache skew, we need to use file sizes in iozone testing that are twice the size of main memory. In our case we already know that 8G is more than twice the size of the client's memory. Iozone has many test modes for reading and writing (more than bonnie++, but not too much more). We can tell iozone to test using different record sizes. To simulate what happens in a default all set of tests, we'll run 8GB file size tests using 4, 8, 16, 32, 128, 256, 512, 1024, 2048, 4096, 8192 and 16384K record sizes. Between each test run, we'll unmount the RAID60 NFS area and remount it in order to attempt to further clean out any cache enhancements.

 size=$1
 size=${size:=8g}
 PATH=$PATH:/iozonepath
 recs=4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384

 for rec in $recs; do
         echo "
         echo **** $rec ****
         echo "
         umount /raid60
         mount /raid60

         iozone -s $size -r $rec -z -R -c -f /raid60/t1 -b exceloutput-$size-$rec.xls
 done

If you choose to run this script, realize that it will take some time to complete. Just make sure you adjust the size appropriately. Obviously, a size of 96 GB will take a VERY long time to complete. Therefore our client was booted with mem=4G in order to reduce the amount of data we need to use for proper testing.

Results and Analysis (Good Guesses)

 48163264128256512102420484096819216384
Writer373604511394378316532318422484520426506226347918539246514660427374375381503549
Re-writer Report557597587929603411616378619852492813616100594596620238615954607980584887594909
Random Write169422301192438139530659659232683237624119604693627939544939587802458906592918
Fwrite551093596047595864593954608940579083604210510357604417604438628514621011596620
Fre-write575070523477633179618556595979587438641082593445612130616077627803568953617020
Reader Report1125109110690611190791129239114376811238701126765113102611306821117520113764811209431134078
Re-reader Report1145572114567011397481142953114205311432131145632114585211444051141477114357711459021140414
Random Read4081673770102192175074288459414093577390641960708604777349818183875123538189
Backward Read3913676023126120184228291809400598512577628363732667816975847590832486922863
Stride Read3988978693132210183728289212422902520578623990763267783725924244951869698765
Fread1140593114242011450341144743114538011446941141816114574711457361145056114539511458101103943
Fre-read1134043114476611455361145289114577811395751111642114463911442871146070114548011456861137543

Sometimes it helps to visualize the data so that trends can be seen:

You can see from the diagram that most of the reads exhibit similar results to bonnie++. Writes are supposed to be limited (in theory) to the 4 Gbit FC on the RAID subsystem. However, we HAVE created a RAID 60 across two different storage arrays, so striped across dual 4 Gbit connections.

Observations and Guesses

  1. Sequential read operations are consistent and fast at ~1130+ MB/s. However, even at 2 x 4 Gbit FC speeds, the maximum read throughput should be less than 1000 MB/s. My guess is that we're getting the benefit of read ahead on the NFS server.
  2. Write operations show the limitations of RAID 6 (RAID 60) to the storage device. Also, the filesystem, reiserfs, will also come into play as it is a journaled filesystem. With that said, again, the NFS system itself will try to hide some of that with memory caching both on the NFS server host and the 2 GB cache on the controllers on the two storage units.
  3. There is an apparent "sweet spot" as record sizes go to 256K and 512K as the NFS server delivers better performance for non-sequential read operations and random writes. Lower record sizes on writes, random reads and backward reads severely impact performance. However this is NOT all that unusual.
  4. There is an very pronounced "ping pong" bounce on writes with record size 4K being less than 400 MB/s and then 8K being over 500 MB/s then 16K being less than 400 MB/s and 32K being over 500 MB/s, etc.
  5. Bonnie++ shows us about ~1140 MB/s on reads, we're seeing about the same via iozone. Bonnie++ shows ~550 MB/s on writes, we're seeing about the same via iozone. Via bonnie++, Seq. Output Rewrite is ~340 MB/s. Iozone seems to indicate that we'll see better performance than that in general. I would expect general read and write work loads to show between 550 MB/s and 900 MB/s.

Phoronix Test Suite Results

http://global.phoronix-test-suite.com/?k=profile&u=cjcox-17384-3855-8027

Page last modified on August 01, 2010, at 02:47 AM