interfilesystem copies: large du diffs

#1 August 24th 05, 10:08 AM

I recently rsync'd around 2.8TB between a RHE server (jfs fs) and a
Netapps system. Did a 'du -sk' against each to verify the transfers:

2894932960 sources total, KB
2751664496 destination total, KB

That's a 140GB discrepancy. Subsequent verbose rsyncs have turned up
nothing that was not originally transferred.

I often note similar behaviour with smaller transfers between servers
with similar OS/fs combos and have always seen it to come extent with
transfers between systems of any type. It's just that the usual
discrepancies in this case are magnified greatly by the sheer volume of
data. Needless to say, 140GB going missing would be a bit of a problem
and it's not much fun picking through 2.8TB for MIA data.

Can anyone shed some light on why this happens?

tia

#2 August 24th 05, 04:10 PM

orgone wrote:

I recently rsync'd around 2.8TB between a RHE server (jfs fs) and a
Netapps system. Did a 'du -sk' against each to verify the transfers:

2894932960 sources total, KB
2751664496 destination total, KB

"df" uses actual blocks allocated. "du" takes the
file size and concludes that all blocks are allocated.

That's a 140GB discrepancy. Subsequent verbose rsyncs have turned up
nothing that was not originally transferred.

I often note similar behaviour with smaller transfers between servers
with similar OS/fs combos and have always seen it to come extent with
transfers between systems of any type. It's just that the usual
discrepancies in this case are magnified greatly by the sheer volume of
data. Needless to say, 140GB going missing would be a bit of a problem
and it's not much fun picking through 2.8TB for MIA data.

Can anyone shed some light on why this happens?

My best guess is the NetApp somehow handles sparsely allocated
files differently so that "du" sees the block actually
allocated not just the file size using the address of the last
byte.

Alternate theory that is far less likely: On your source tree
you have a history of making hundreds of thousands of files
and then deleting nearly all of them, leaving a lot of very
large directories. On your target tree the directories are
much smaller.

Yet another alternate theory: Smaller blcok/fragment/extent
size on the target. So on the source any file has a fairly
large minimum block count but on the target smaller files
take fewer blocks. You would need very many small files to
account for a 3% difference, but a few 100K files under 512
bytes should cause this.

#3 August 24th 05, 06:12 PM

In article .com,
orgone wrote:
I recently rsync'd around 2.8TB between a RHE server (jfs fs) and a
Netapps system. Did a 'du -sk' against each to verify the transfers:

2894932960 sources total, KB
2751664496 destination total, KB

That's a 140GB discrepancy. Subsequent verbose rsyncs have turned up
nothing that was not originally transferred.

I often note similar behaviour with smaller transfers between servers
with similar OS/fs combos and have always seen it to come extent with
transfers between systems of any type. It's just that the usual
discrepancies in this case are magnified greatly by the sheer volume of
data. Needless to say, 140GB going missing would be a bit of a problem
and it's not much fun picking through 2.8TB for MIA data.

Can anyone shed some light on why this happens?

tia

First, this is only a 5% difference. I could easily imagine the
difference being much larger:

The du command (and the underlying st_blocks field in the result of
the stat() system call) reports the amount of space used. But:
- A filesystem uses space not only for the data component (the bytes
that are stored in the files), but also for overhead: directories,
per-file overhead like inodes and indirect blocks, and more, often
referred to as metadata. How efficiently this overhead is stored
varies considerably by filesystem. And whether this overhead is
reported as part of the answer from du also varies. In some extreme
cases (filesystems that separate their data and metadata physically)
this overhead is not reported at all. The ratio of metadata to data
varies considerably by file system type and by file/directory size,
but for many small files 5% is not out of line.
- The amount of space allocated to a file typically has some
granularity, which is often 4KB or 16KB (historically, it has ranged
from 128 bytes for the cp/m filesystem to 256KB for some filesystems
used in high-performance computing). This means the size of the
file is rounded up to this granularity, which if your files are
typically small can make a huge difference. Say your files are all
2KB, and you store them on a file system with a 512B allocation
granularity and on another one with 16KB allocation granularity,
you'll get a result from du that is different by a factor of 32!
- Are any of your files sparse? I think every commercial filesystem
that's in mass-production today supports sparse files. But exactly
how can vary widely. What is the granularity of holes in the file?
What is the metadata overhead for holes (in extent-based filesystems
this can make a significant difference if implemented carelessly)?
Also, it is quite possible (maybe even likely) that your rsync
copying turned sparse files into contiguous files. Given that your
total space usage shrank instead of increased, this doesn't seem
likely to be the main effect here.
- On the netapp, did you have snapshot turned on? If yes, does the
result from du include the snapshots?
- It isn't even completely clear what the result from du is supposed
to be. The real disk usage? The size of the file rounded to
kilobytes? Here is a suggestion to stir the pot: Assume you have a
1MB file stored on a RAID-1 (mirrored) disk array. I think du
should report the space usage as 2MB, because you are actually
storing two copies of the file (you are using 2MB worth of disks).
If you now migrate the file to a compressing filesystem that is not
mirrored, du should report the space usage as 415KB, if that's how
much disk space it really uses. No filesystem today would report
those values, they would all report something pretty close to 1MB.

For you, this is my suggestion: Instead of looking only at the total,
make a complete list of the disk usage for each file. An easy way to
do this from the command line is this. Make two listings of space
usage, one each for source and destination, merge the lists, and look
at the differences. Here is a quick attempt at a script which does
this (just typed this in, you may have to debug it a little bit, and
it assumes you don't have spaces in file names, if you do you'll have
to do a lot of quoting and null-terminating):
cd $SOURCE
find . -type f | xargs du -k | sort +1 /tmp/source.du
cd $TARGET
find . -type f | xargs du -k | sort +1 /tmp/target.du
cd /tmp
join -j 2 source.du target.du both.du
awk '{print $1, $3 - $2}' both.du | sort -n +1 diff.du
In the end, you'll have a listing of the difference in space usage in
diff.du, sorted (I hope, I can never remember whether the -n switch to
sort works correctly for negative numbers). Then pick a few examples
of files that have large differences, or see whether you can make out
a trend (maybe most files have a small difference). Then spot-check a
few files, to make sure they were copied correctly.

You can also use "join -j 2 -v 1 source.du target.du" to find files
that were not copied, and the same with "-v 2" to find files that
showed up in the copy uninvited.

Now changing gears: Speaking as file system implementor (and somewhat
of an expert), I would wish that the du command and the underlying
information returned by the stat() system call would go away. On one
hand, they are just to crude and don't begin to describe the
complexity of space usage in a modern (complex) filesystem. On the
other hand, they don't give the information answers that a system
administrator (or an automated administration tool) really needs. As
we saw above, for a 1GB file, the correct answer for space usage might
be any of (all the numbers are made up)
- 1GB worth of bytes
- 1GB is the file size, but it is sparse, so it only uses 876MB.
- 1GB worth of bytes on the data disk, plus 7.4MB of metadata on the
metadata disk.
- 2GB worth of bytes, because of RAID 1.
- 437MB worth of bytes, because of compression.
- 0.456GB on datadisk_123, 1.234GB on datadisk_456, and 2.345GB on
datadisk_789, plus 7.4MB on metadisk_abc and 3.7MB on metadisk_def.
- 5.678GB on disk, because of RAID 1, asynchronous remote copy (still
0.3GB worth of copying to be done, currently held in NVRAM), and
fourteen snapshot copies, all slightly different, not to mention
that the remote copy is compressed, and this figure includes
metadata overhead on the metadata disks.
- 4.567GB on expensive SCSI disks (at $3/GB plus $0.50/year/GB), and
1.234GB on cheap SATA disks (at $1/GB plus $0.25/year/GB).
As you see, returning one number is woefully inadequate.

We need to ask ourselves: What is the purpose of the space usage
information? It is not to verify that the file system has correctly
stored the data (for that it is too crude), it is to enable
administrating the file system, so it needs to give the information a
system administrator might care about.

If I had my way (fortunately, nobody ever listens to me), I would
remove the du command and completely remove all notions of space usage
from the user-mode application API, and put all space usage
information into a file system management interface. There, questions
like the following need to be answered:

- How much space is user fred using (or files used by the wombat
project, or files stored on storage device foobar)?
- Has fred's usage increased recently?
- How expensive is the storage used by fred? Original purchase, lease
payments, yearly provisioning and administration cost?
- Are the wombat projects requirements for data availability being
met, or could I improve them by allocating more space to it and
storing more redundant copies of their data?
- If I move the wombat project to the netapp, and then use the free
space on the cluster filesystem to put fred's files on, would that
save me money or increase speed or availability?
- Is the netapp still a cost-effective device, given that we just
started using the fancy new foobar device from Irish Baloney
Machines with the new cluster filesystem from Hockey-Puckered?

(If it isn't clear, all mentions of the word "netapp" and oblique
references to large computer companies are meant as humor, and are
intended to neither praise nor denigrate my current, former or future
employers).

--
The address in the header is invalid for obvious reasons. Please
reconstruct the address from the information below (look for _).
Ralph Becker-Szendy

#4 August 24th 05, 09:04 PM

On 24 Aug 2005 02:08:46 -0700, orgone said something similar to:
: I recently rsync'd around 2.8TB between a RHE server (jfs fs) and a
: Netapps system. Did a 'du -sk' against each to verify the transfers:
:
: 2894932960 sources total, KB
: 2751664496 destination total, KB
:
: That's a 140GB discrepancy. Subsequent verbose rsyncs have turned up
: nothing that was not originally transferred.

What are the native block sizes of the two filesystems? If you've got
a large enough number of files and directories there, a smaller block size
on the destination could account for the discrepancy in terms of less unused
space at the end of the last block of each file.

Another thing that I've seen cause discrepancies like this on occasion is
when the source directories once had many more files in them then they
currently do. Once more blocks have been allocated to a directory, they
don't get deallocated when the number of files drops.

#5 August 25th 05, 10:10 AM

orgone wrote:
I recently rsync'd around 2.8TB between a RHE server (jfs fs) and a
Netapps system. Did a 'du -sk' against each to verify the transfers:

2894932960 sources total, KB
2751664496 destination total, KB

That's a 140GB discrepancy. Subsequent verbose rsyncs have turned up
nothing that was not originally transferred.

I often note similar behaviour with smaller transfers between servers
with similar OS/fs combos and have always seen it to come extent with
transfers between systems of any type. It's just that the usual
discrepancies in this case are magnified greatly by the sheer volume of
data. Needless to say, 140GB going missing would be a bit of a problem
and it's not much fun picking through 2.8TB for MIA data.

Rsync has a "-c" option for producing checksums, I imagine that would
give me some reassurance that the transfer ocurred correctly. There is
also the "-v" verbose option as you noted.

To be certain I'd consider checksumming all the files on each system
(e.g. something like find mydirectory -exec sum {} \; sysname.sums)
and use diff to compare the results. If really paranoid I'd use md5sum
instead of sum. I imagine this will take considerable time on 2.8TB so
I'd try it on small subsets first :-)

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
WinXP/ X copy Dos Command/ Dick Backup	Jeff	Storage (alternative)	8	May 21st 05 01:38 AM
Advice Please: The Importance of Hard Drive RPMs	Darren Harris	Storage (alternative)	101	August 24th 04 11:16 PM
Large Hard Drive & BIOS upgrade problems	Lago Jardin	Overclocking	10	June 13th 04 12:56 AM
Large Hard Drive & BIOS upgrade problems	Lago Jardin	Homebuilt PC's	1	June 12th 04 02:08 PM
LBA, Normal or Large ?	Al Franz	Homebuilt PC's	1	January 10th 04 12:35 AM