**Writing to block device is slower than writing to the filesystem!?**

**kkkk**

Hi all,
we have a new machine with 3ware 9650SE controllers and I am testing
hardware RAID and linux software MD raid performances
For now I am on hardware RAID. I have setup a raid-0 with 14 drives.

If I create an xfs filesystem on it (whole device, no partitioning,
aligned stripes during mkfs, etc) then I write to a file with dd (or
with bonnie++) like this:
sync ; echo 3 /proc/sys/vm/drop_caches ; dd if=/dev/zero
of=/mnt/tmp/ddtry bs=1M count=6000 conv=fsync ; time sync
about 540MB/sec come out (last sync takes 0 seconds). This is similar to
3ware-declared performances of 561MB/sec
http://www.3ware.com/KB/Article.aspx?id=15300

however, if instead I write directly to the block device like this
sync ; echo 3 /proc/sys/vm/drop_caches ; dd if=/dev/zero of=/dev/sdc
bs=1M count=6000 conv=fsync ; time sync
performance is 260MB/sec!?!? (last sync takes 0 seconds)

I tried many times and this is the absolute fastest I could obtain. I
tweaked the bs, the count, I removed the conv=fsync... i ensured 3ware
caches are ON on the block device, I set anticipatory scheduler... No
way. I am positive that creating the xfs filesystem and writing on it is
definitely faster than writing to the block device directly.

How could that be!? Anyone knows what's happening?

Please note that the machine is absolutely clean and there is no other
workload. I am running kernel 2.6.31 (ubuntu 9.10 alpha live).

Thank you

**Mark F[_2_]** · #2 August 7th 09, 08:30 PM posted to comp.arch.storage

On Fri, 07 Aug 2009 14:30:11 +0200, kkkk wrote:

Hi all,
we have a new machine with 3ware 9650SE controllers and I am testing
hardware RAID and linux software MD raid performances
For now I am on hardware RAID. I have setup a raid-0 with 14 drives.

If I create an xfs filesystem on it (whole device, no partitioning,
aligned stripes during mkfs, etc) then I write to a file with dd (or
with bonnie++) like this:
sync ; echo 3 /proc/sys/vm/drop_caches ; dd if=/dev/zero
of=/mnt/tmp/ddtry bs=1M count=6000 conv=fsync ; time sync
about 540MB/sec come out (last sync takes 0 seconds). This is similar to
3ware-declared performances of 561MB/sec
http://www.3ware.com/KB/Article.aspx?id=15300

however, if instead I write directly to the block device like this
sync ; echo 3 /proc/sys/vm/drop_caches ; dd if=/dev/zero of=/dev/sdc
bs=1M count=6000 conv=fsync ; time sync
performance is 260MB/sec!?!? (last sync takes 0 seconds)
I haven't played with UNIX since I retired in 1994, but here are
some suggestions:
.. Does dd buffer correctly? (Probably does, but good to check)
.. Compare the I/O counts for both methods; Compare the
CPU time for both methods. I remember issues where the dummy
devices used a small {block size, record size, or some such}
so that there were high I/O counts and therefore high CPU
use when we didn't expect it.
.. Can you report the results for each of the various bs values
that you used? You say that you tried various
values and are just reporting the best result, but it would
be nice to see how things are affected as the blocksize changes.
.. Perhaps the filesystem is smart enough to avoid some movement
between buffers that writing to a block device has to do.
A difference in CPU use might be an indication of this, but
a difference could be caused by other reasons.

(Don't laugh at my experience being so old: I've seen a couple
of problems reported this year [2009] that were the same as I
saw before 1975. And that doesn't count all of the buffer overflow
crap that was solved in hardware before 1961.)

I tried many times and this is the absolute fastest I could obtain. I
tweaked the bs, the count, I removed the conv=fsync... i ensured 3ware
caches are ON on the block device, I set anticipatory scheduler... No
way. I am positive that creating the xfs filesystem and writing on it is
definitely faster than writing to the block device directly.

How could that be!? Anyone knows what's happening?

Please note that the machine is absolutely clean and there is no other
workload. I am running kernel 2.6.31 (ubuntu 9.10 alpha live).

Thank you

**David Schwartz**

On Aug 7, 5:30*am, kkkk wrote:

If I create an xfs filesystem on it (whole device, no partitioning,
aligned stripes during mkfs, etc) then I write to a file with dd (or
with bonnie++) like this:
* sync ; echo 3 /proc/sys/vm/drop_caches ; dd if=/dev/zero
of=/mnt/tmp/ddtry bs=1M count=6000 conv=fsync ; time sync
about 540MB/sec come out (last sync takes 0 seconds). This is similar to
3ware-declared performances of 561MB/sec
*http://www.3ware.com/KB/Article.aspx?id=15300

however, if instead I write directly to the block device like this
* sync ; echo 3 /proc/sys/vm/drop_caches ; dd if=/dev/zero of=/dev/sdc
bs=1M count=6000 conv=fsync ; time sync
performance is 260MB/sec!?!? (last sync takes 0 seconds)

I tried many times and this is the absolute fastest I could obtain. I
tweaked the bs, the count, I removed the conv=fsync... i ensured 3ware
caches are ON on the block device, I set anticipatory scheduler... No
way. I am positive that creating the xfs filesystem and writing on it is
definitely faster than writing to the block device directly.

How could that be!? Anyone knows what's happening?

There could be a lot of reasons, but the most likely is that they're
writing to opposite ends of the drive. To test, put a 'skip' in your
'dd' to the block device. See if larger skips result in higher speeds.

DS

**kkkk**

David Schwartz wrote:
There could be a lot of reasons, but the most likely is that they're
writing to opposite ends of the drive. To test, put a 'skip' in your
'dd' to the block device. See if larger skips result in higher speeds.

Nope, it's not that. I seeked as you said to the end of the device and
the speed is not significantly different. Writing to the device goes
from 239 to 233 MB/sec (it's actually a bit faster at the beginning).

I am positive that the seek value I used for dd is correct because I
tried to raise it a bit further and it gave me error: dd: `/dev/sdc':
cannot seek: Invalid argument

Next idea...?

Thank you!

**kkkk**

kkkk wrote:
Hi all,
we have a new machine with 3ware 9650SE controllers and I am testing ...

I found it! I found it!

dd apparently does not buffer writes correctly (good catch, Mark):
apparently disregards bs value and submits very small writes. It needs
oflags=direct to really do that, and even then there's a limit. Also,
elevator merging of small writes does not try hard enough and cannot
achieve good throughput. More details tomorrow.

**Robert Nichols[_2_]**

In article s.com,
kkkk wrote:
:kkkk wrote:
: Hi all,
: we have a new machine with 3ware 9650SE controllers and I am testing ...
:
:I found it! I found it!
:
:dd apparently does not buffer writes correctly (good catch, Mark):
:apparently disregards bs value and submits very small writes. It needs

flags=direct to really do that, and even then there's a limit. Also,
:elevator merging of small writes does not try hard enough and cannot
:achieve good throughput. More details tomorrow.

Curious. I'm not seeing that behavior in either Centos 5 or Fedora 11
(coreutils-5.97-19.el5, coreutils-7.2-2.fc11). In both of those, when I
run:

strace dd if=/dev/zero bs=1M count=1 of=somefile conv=fsync

I see exactly one read and write, each of size 1048576.

--
Bob Nichols AT comcast.net I am "RNichols42"

**kkkk**

Robert Nichols wrote:
Curious. I'm not seeing that behavior in either Centos 5 or Fedora
11 (coreutils-5.97-19.el5, coreutils-7.2-2.fc11). In both of those,
when I run:

strace dd if=/dev/zero bs=1M count=1 of=somefile conv=fsync

I see exactly one read and write, each of size 1048576.

I haven't straced it but this is what appears from iostat -x 1 (grabbed
from live iostat)

Without direct: (bs=1M)

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util

sdc 0.00 559294.00 0.00 14384.00 0.00 570550.00
39.67 143.98 9.96 0.07 100.00

With direct: (bs=1M)

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util

sdc 0.00 0.00 0.00 3478.00 0.00 890368.00
256.00 5.77 1.66 0.28 98.40

You see, without direct there are a whole lot of wrqm/s (= probably lots
of wasted CPU cycles), and the average submitted size is still 143.98
256.0 (I suppose 143.98 is after the merges, correct?)

With direct there are no wrqm/s, and the submitted request size is 256
sectors exactly.

With oflag=direct, performances increase with increasing bs, like this:

(3ware 9650SE-16ML hw raid-0 256K chunk size, 14 disks [1TB 7200RPM SATA])
bs size - speed:
512B - 4.9MB/sec
1K - 13.3MB/sec
2K - 26.6MB/sec
4K - 54.1MB/sec
8K - 96MB/sec
16K - 157MB/sec
32K - 231 MB/s
64K - 300 MB/s
128K - 359 MB/s (from this point on, avgrq-sz does not increase
anymore, but performances still increase)
256K - 404MB/sec
512K - 430MB/sec
1M - 456MB/sec
2M - 466MB/sec
4M - 473MB/sec
3584K (stripe size) - 494MB/sec
8M - 542MB/sec !! A big performance jump!!
16M - 543MB/sec
32M - 568MB/sec ! Another big performance jump
64M - 603MB/sec ! Again !! Here are CPU occupations: real 0m11.213s,
user 0m0.004s, sys 0m3.880s
128M - 641MB/sec
256M - 676MB/sec
512M - 645MB/sec (performances start dropping)
1G - 620MB/sec

Avgrq-sz apparently cannot go over 256 sectors, is this a hardware limit
by the device, 3ware?

Notwithstanding this, performances still increase up to bs=256M. From
iostat the only apparent change (apart from increasing wsec/s obviously)
is avgqu-sz, being 1.0 up to bs=128K, and then raising to about 20.0
at bs=256M. Do you think this can be the reason for the performance
increase up to 256M?

Thanks for any thoughts.

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
dvd writing	DNA	Cdr	7	June 18th 05 08:11 AM
CDRW writing always slower than CDR?	SleeperMan	Cdr	4	December 13th 04 03:50 PM
HardDrive: Reading & Writing Slower	UWS	General Hardware	0	January 8th 04 06:32 PM
CD Writing	Blade	General	8	October 7th 03 07:12 PM
NF7-S cd writing	Richard Rollins	Abit Motherboards	1	July 22nd 03 10:31 PM

Writing to block device is *slower* than writing to the filesystem!?

**Writing to block device is slower than writing to the filesystem!?**