Disk to disk copying with overclocked memory

#11 March 11th 04, 01:21 PM

In comp.sys.ibm.pc.hardware.storage CBFalconer wrote:
Colin Painter wrote:

If I can add a bit to JT's reply...

If you are overclocking your memory you risk getting more errors
than the guys who built the memory planned on. If the memory is
not ECC memory then you may get more single bit errors which will
cause your machine to stop when they occur. ECC memory can
correct single bit errors but non-ECC memory can only detect them
and when that happens windows will blue screen. Most home PCs
have non-ECC memory because it's cheaper.

Correction here - non ECC memory won't even detect any errors, it
will just use the wrong value. Sometimes that MAY cause the OS to
crash. Unfortunately the rest of the thread is lost due to
top-posting.

Crashes are not your worst enemy. Undetected data corruption is.

I once debugged a fileserver that did flip one bit on average per
2GB read or written. This thing had been used in this condition for
several months by several people on a daily basis. Then one person
noted that he got a corrupted archive sometimes (was a large file)
when reading it, and sometimes not. There where likely quite
a few changed files on disk at that time. If you have files that
react badly to changed bits, that is a desaster.

The solution was just to set the memory timing more conservatively.
I made it two steps slower, without noticable impact on performance.

Note on ECC: If you get very little single bit-errors without
ECC active, ECC will likely solve your problem. If you a lot of
single-bit errors, or even only very fwe multiple-bit errors, then
ECC wil not really help and will let errors through. For my scenario
(single, random bit every 2GB), ECC would have done fine.

Arno
--
For email address: lastname AT tik DOT ee DOT ethz DOT ch
GnuPG: ID:1E25338F FP:0C30 5782 9D93 F785 E79C 0296 797F 6B50 1E25 338F
"The more corrupt the state, the more numerous the laws" - Tacitus

#12 March 11th 04, 01:43 PM

Arno Wagner wrote:

In comp.sys.ibm.pc.hardware.storage CBFalconer
wrote:
Colin Painter wrote:

If I can add a bit to JT's reply...

If you are overclocking your memory you risk getting more errors
than the guys who built the memory planned on. If the memory is
not ECC memory then you may get more single bit errors which will
cause your machine to stop when they occur. ECC memory can
correct single bit errors but non-ECC memory can only detect them
and when that happens windows will blue screen. Most home PCs
have non-ECC memory because it's cheaper.

Correction here - non ECC memory won't even detect any errors, it
will just use the wrong value. Sometimes that MAY cause the OS to
crash. Unfortunately the rest of the thread is lost due to
top-posting.

Crashes are not your worst enemy. Undetected data corruption is.

I once debugged a fileserver that did flip one bit on average per
2GB read or written. This thing had been used in this condition for
several months by several people on a daily basis. Then one person
noted that he got a corrupted archive sometimes (was a large file)
when reading it, and sometimes not. There where likely quite
a few changed files on disk at that time. If you have files that
react badly to changed bits, that is a desaster.

The solution was just to set the memory timing more conservatively.
I made it two steps slower, without noticable impact on performance.

Note on ECC: If you get very little single bit-errors without
ECC active, ECC will likely solve your problem. If you a lot of
single-bit errors, or even only very fwe multiple-bit errors, then
ECC wil not really help and will let errors through. For my scenario
(single, random bit every 2GB), ECC would have done fine.

The ECC implemented on PCs can typically correct 1-bit errors and detect
2-bit errors.

One machine I worked with came up with a parity error one day. It was about
a week old at the time so I sent it back to the distributer, who, being one
of these little hole in the wall places and not Tech Data or the like,
instead of swapping the machine or the board, instead had one of his
high-school dropout techs "fix" it. The machine came back sans parity
error. Ran fine for a while, then started getting complaints of data
corruption. Tracked it down finally to a bad bit in the memory. Sure
enough the guy had "fixed" it by disabling parity. Should have sued.

This is one of the pernicious notions surrounding the testing of PCs--the
notion that the only possible failure mode is a hang, totally ignoring the
possibility that there will be data corruption that does not cause a hang,
at least not of the machine, although it may cause the tech to be hung by
the users.

But if you're getting regular errors then regardless of the kind of memory
you're using something is broken. Even with ECC if you're getting errors
reported in the log you should find out why and fix the problem rather than
just trusting the ECC--ECC is like RAID--it lets you run a busted machine
without losing data--doesn't mean that the machine isn't busted and doesn't
need fixing.

Arno

--
--John
Reply to jclarke at ae tee tee global dot net
(was jclarke at eye bee em dot net)

#13 March 11th 04, 03:53 PM

I've had an MB, which occasionally corrupted bit 0x80000000, but only during
disk I/O! And the corrupted bit position was unrelated to I/O buffers! Of
course, standalone memory test didn't find anything. I've had to modify the
test to make it run under Windows and also run parallel disk I/O threads. In
that mode, the failure was detected in a minute. Had to dump the MB.
Replacing memory and CPU didn't help.

"Arno Wagner" wrote in message
...
Crashes are not your worst enemy. Undetected data corruption is.

I once debugged a fileserver that did flip one bit on average per
2GB read or written. This thing had been used in this condition for
several months by several people on a daily basis. Then one person
noted that he got a corrupted archive sometimes (was a large file)
when reading it, and sometimes not. There where likely quite
a few changed files on disk at that time. If you have files that
react badly to changed bits, that is a desaster.

The solution was just to set the memory timing more conservatively.
I made it two steps slower, without noticable impact on performance.

#14 March 11th 04, 04:10 PM

"J. Clarke" wrote:
Arno Wagner wrote:
CBFalconer wrote:
Colin Painter wrote:

If I can add a bit to JT's reply...

If you are overclocking your memory you risk getting more errors
than the guys who built the memory planned on. If the memory is
not ECC memory then you may get more single bit errors which will
cause your machine to stop when they occur. ECC memory can
correct single bit errors but non-ECC memory can only detect them
and when that happens windows will blue screen. Most home PCs
have non-ECC memory because it's cheaper.

Correction here - non ECC memory won't even detect any errors, it
will just use the wrong value. Sometimes that MAY cause the OS to
crash. Unfortunately the rest of the thread is lost due to
top-posting.

Crashes are not your worst enemy. Undetected data corruption is.

I once debugged a fileserver that did flip one bit on average per
2GB read or written. This thing had been used in this condition for
several months by several people on a daily basis. Then one person
noted that he got a corrupted archive sometimes (was a large file)
when reading it, and sometimes not. There where likely quite
a few changed files on disk at that time. If you have files that
react badly to changed bits, that is a desaster.

The solution was just to set the memory timing more conservatively.
I made it two steps slower, without noticable impact on performance.

Note on ECC: If you get very little single bit-errors without
ECC active, ECC will likely solve your problem. If you a lot of
single-bit errors, or even only very fwe multiple-bit errors, then
ECC wil not really help and will let errors through. For my scenario
(single, random bit every 2GB), ECC would have done fine.

The ECC implemented on PCs can typically correct 1-bit errors and
detect 2-bit errors.

One machine I worked with came up with a parity error one day. It
was about a week old at the time so I sent it back to the distributer,
who, being one of these little hole in the wall places and not Tech
Data or the like, instead of swapping the machine or the board,
instead had one of his high-school dropout techs "fix" it. The
machine came back sans parity error. Ran fine for a while, then
started getting complaints of data corruption. Tracked it down
finally to a bad bit in the memory. Sure enough the guy had "fixed"
it by disabling parity. Should have sued.

This is one of the pernicious notions surrounding the testing of
PCs--the notion that the only possible failure mode is a hang,
totally ignoring the possibility that there will be data corruption
that does not cause a hang, at least not of the machine, although
it may cause the tech to be hung by the users.

But if you're getting regular errors then regardless of the kind of
memory you're using something is broken. Even with ECC if you're
getting errors reported in the log you should find out why and fix
the problem rather than just trusting the ECC--ECC is like RAID--it
lets you run a busted machine without losing data--doesn't mean
that the machine isn't busted and doesn't need fixing.

Well, this is somewhat refreshing. Usually when I get on my horse
about having ECC memory I am greeted with a chorus of pooh-poohs,
and denials about sneaky soft failures, cosmic rays, useless
backups, etc. etc. In fact, walk into most computer stores and
start talking about ECC and you will be greeted with blank stares.

--
Chuck F ) )
Available for consulting/temporary embedded and systems.
http://cbfalconer.home.att.net USE worldnet address!

#15 March 11th 04, 05:24 PM

"CBFalconer" wrote in message
Colin Painter wrote:

If I can add a bit to JT's reply...

If you are overclocking your memory you risk getting more errors
than the guys who built the memory planned on. If the memory is
not ECC memory then you may get more single bit errors which will
cause your machine to stop when they occur. ECC memory can
correct single bit errors but non-ECC memory can only detect them
and when that happens windows will blue screen. Most home PCs
have non-ECC memory because it's cheaper.

Correction here - non ECC memory won't even detect any errors,
it will just use the wrong value. Sometimes that MAY cause the OS to
crash.

Unfortunately the rest of the thread is lost due to top-posting.

Bash topposters for topposting, not for your bad choice of News client
or your failure to set it up properly.

#16 March 11th 04, 08:02 PM

In comp.sys.ibm.pc.hardware.storage Alexander Grigoriev wrote:
I've had an MB, which occasionally corrupted bit 0x80000000, but only during
disk I/O! And the corrupted bit position was unrelated to I/O buffers! Of
course, standalone memory test didn't find anything. I've had to modify the
test to make it run under Windows and also run parallel disk I/O threads. In
that mode, the failure was detected in a minute. Had to dump the MB.
Replacing memory and CPU didn't help.

Really nasty. Shows that these things have gotten far to complex...

Arno

--
For email address: lastname AT tik DOT ee DOT ethz DOT ch
GnuPG: ID:1E25338F FP:0C30 5782 9D93 F785 E79C 0296 797F 6B50 1E25 338F
"The more corrupt the state, the more numerous the laws" - Tacitus

#17 March 11th 04, 10:58 PM

"Alexander Grigoriev" wrote in message hlink.net...
I've had an MB, which occasionally corrupted bit 0x80000000, but only
during disk I/O!

And the corrupted bit position was unrelated to I/O buffers!

Meaning?

Of course, standalone memory test didn't find anything. I've had to modify the
test to make it run under Windows and also run parallel disk I/O threads.

What happened to that memory test. Last time I heard about it was when c't complained about you not supporting it anymore.

In that mode, the failure was detected in a minute. Had to dump the MB.
Replacing memory and CPU didn't help.

"Arno Wagner" wrote in message ...
Crashes are not your worst enemy. Undetected data corruption is.

I once debugged a fileserver that did flip one bit on average per
2GB read or written. This thing had been used in this condition for
several months by several people on a daily basis. Then one person
noted that he got a corrupted archive sometimes (was a large file)
when reading it, and sometimes not. There where likely quite
a few changed files on disk at that time. If you have files that
react badly to changed bits, that is a desaster.

The solution was just to set the memory timing more conservatively.
I made it two steps slower, without noticable impact on performance.

#18 March 12th 04, 12:49 AM

If you've overclocked the RAM, yes, there is a chance of getting timing
errors in the data which will lead to data corruption.

--
DaveW

"Mark M" wrote in message
...
I use a partition copier which boots off a floppy disk before any
other OS is launched.

If I copy a partition from one hard drive to another, then is there
any risk of data corruption if the BIOS has been changed to
aggressively speed up the memory settings?

For example the BIOS might set the memory to CAS=2 rather than
CAS=3. Or other memory timing intervals might also be set to be
shorter than is normal.

I am thinking that maybe the IDE cable and drive controllers handle
data fairly independently of the memory on the motherboard. So
maybe data just flows up and down the IDE cable and maybe the
motherboard is not involved except for sync pulses.

There are three scenarios I am thinking about:

(1) Copying a partition from one hard drive on one IDE cable to
another hard drive on a different IDE cable.

(2) Copying a partition from one hard drive to another which is on
the same IDE cable.

(3) Copying one partition to another on the same hard drive.

How much effect would "over-set" memory have on these situations?

Do the answers to any of the above three scenarios change if the
copying of large amounts of data files is done from within WinXP?
Personally, I would guess that it is more likely that motherboard
memory comes into play if Windows is involved.

#19 March 12th 04, 01:28 AM

On Thu, 11 Mar 2004 00:40:47 GMT, Mark M
wrote:

I use a partition copier which boots off a floppy disk before any
other OS is launched.

If I copy a partition from one hard drive to another, then is there
any risk of data corruption if the BIOS has been changed to
aggressively speed up the memory settings?

yes if the machine was OC-ed to much to be really 100% rock solid %
tested to be TL stable at that settings ...

-- Regards, SPAJKY Â®
& visit my site @ http://www.spajky.vze.com
"Tualatin OC-ed / BX-Slot1 / inaudible setup!"
E-mail AntiSpam: remove ##

#20 March 12th 04, 05:30 AM

"Folkert Rienstra" wrote in message
...

"Alexander Grigoriev" wrote in message
hlink.net...
I've had an MB, which occasionally corrupted bit 0x80000000, but only
during disk I/O!

And the corrupted bit position was unrelated to I/O buffers!

Meaning?

That it was not corrupt in transmit from/to disk. For example, memory
allocated from kernel non-paged pool. One time I caught the crash with a
debugger, and this was KEVENT structure corrupt.

Of course, standalone memory test didn't find anything. I've had to
modify the
test to make it run under Windows and also run parallel disk I/O
threads.

What happened to that memory test. Last time I heard about it was when c't
complained about you not supporting it anymore.

Version 2 is available on
http://home.earthlink.net/~alegr/download/memtest.htm

Previous location at www.aha.ru is offline, AFAIK.

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
"Safe" memory testing	Timothy Lee	General	1	March 8th 04 08:04 PM
CAS Timings De-Mystified, and other JEDEC Zins of DDR cRAMming...(Server Problems)	Aaron Dinkin	General	0	December 30th 03 02:29 AM
CAS Timings De-Mystified, and other JEDEC Zins of DDR cRAMming...	Aaron Dinkin	General	0	December 30th 03 02:12 AM
Buying Kingston RAM chips...	Wald	General	7	December 6th 03 04:56 AM
Chaintech 7NIF2 motherboard - memory problems	Wuahn	General	1	July 26th 03 01:29 PM