View Single Post
  #1  
Old October 10th 04, 06:32 AM
kony
external usenet poster
 
Posts: n/a
Default A7N8X Motherboard Low Temperature Sensitivity, CMOS Checksum Error

Followups-To alt.comp.hardware

A7N8X Motherboard Low Temperature Sensitivity, CMOS Checksum
Error



SHORT VERSION:

If room ambient temp drops below 25C, system is so
instable it can't even complete a POST. If system is
running when room temp drops, it then starts to act erratic
(odd Explorer pauses and Prime95 errors), powering off then
back on a dozen seconds later, fails resulting in only
""CMOS Checksum Error" and automatically booting to floppy,
running awdflash.

Once ambient temp rises to 32-24C, system always POSTS
and runs fine. In-between these temps, failure to get
beyond "CMOS Checksum Error" and errors in Windows, go up in
frequency as temp drops. Multiple troubleshooting attempts
have been made (Clear CMOS, BIOS flash, swap hardware,
remount board in case, strip down system, etc), problem
appears isolated to motherboard itself. Different BIOS,
bios defaults, etc, have been tried. System is not
overclocked.

What are the potential cause(s) and the best methods to
check these?




LONG VERSION:


A7N8X Deluxe Motherboard rev 2.00
Socket A, nForce2-400
Current BIOS 1008
Athlon XP2400
512MB Kingston PC3200
Fortron 400W Power Supply
ATI AIW 128 Pro
1 HDD, CDROM, Floppy, (typical non-power-user PC)

Approximately 1 year old, system remained unchanged for that
period of time and had (guessing) about 800 hours of
on-time, it was not used very much, and AFAIK, nothing
demanding, it had an easy life so far. All settings were
conservative, default values, no overclocking and minimal
BIOS changes.

Board appears to be very sensitive to temperature, but not
too hot, rather too cold.
If case thermometer (not motherboard integrated temp sensor
but a separate digital probe) reads below approx. 27C, then
powering on system from soft-off state results in system
displaying the following message:

"CMOS Checksum Error"

Then system proceeds to do an Award BIOS recovery by booting
to awdflash if appropriate floppy is already in the system.
There is no option to do anything else, the BIOS setup is
not accessible and the attempt to boot floppy is automatic,
not a normal "boot to floppy" event as would occur with any
normally-working system that has a boot floppy in it.
Clearing CMOS and/or loading setup defaults does not resolve
this.

If the ambient temp is right at 25C or slightly higher,
multiple attempts at powering off, then on, will result in
system posting with setup defaults for FSB & multiplier,
successfully doing so at a rate of roughly 1 in 5 tries,
more often as room temp rises, less as temp falls. Even if
system is manually set to same speeds or very low speed (6X
multiplier for CPU and 100MHz FSB and memory), after saving
these changes (or not) the system will not proceed with POST
again, it takes several tries to get system to post, again
displaying the "CMOS checksum Error" each time. Unplugging
power supply from AC had no effect, nor did clearing CMOS.
I've not pinned down the EXACT temp, since it gets
progressively worse and the range is fairly tight, but it
roughly corresponds to 25C-32C being fail-pass thresholds,
certainly within 10C temp rise it goes from unable to POST
to working fine. This has been deliberately reproduced
(later) by cranking up an air conditioner, it is clearly
low-temperature related, but first the steps prior to this
conclusion...

It does sometimes POST after saving the changes, but if then
powered off it may not POST the next time... seems to still
be marginal regardless of the BIOS settings, since even
loading setup defaults and clearing CMOS didn't change the
roughly 1-in-5 success rate. Every time it fails, the
video does come up and it does attempt to boot to floppy, it
never just acts dead, always has video display. Every
common troubleshooting procedure I could think of was tried
to no avail. Certainly more than mentioned here but due to
the length of the post I'll try to list what seems most
relevant.

Initally suspecting BIOS corruption, I'd flashed the board
with same bios version, which at the time seemed to work,
but later it was discovered that the difference was instead
that the ambient temperature was higher than previously,
because soon enough the temp had dropped and system again
failed to do anything more than "CMOS Checksum Error". The
next time I had a chance I flashed the next, newest bios
version, with no change. I'd been suspecting the often
rumored "nvidia bios corruption" problem, which seems to
occur from exiting the bios too quickly when saving
settings, but this was not the case with this board.

Later it was noted that doing NOTHING to system other than
leaving it sit until ambient temp rose, would return system
to 100% stable state. Even overclocking quite a bit it
passed several different stress tests at 32C room temp, but
once temp falls again, still not stable even at 6 x 100. I
could understand if it these were arctic conditions but such
a drastic change within a span of 10C seems quite unusual.
Indeed, several other systems in same room do work fine at
same temp. Also notable is that if the temp is barely high
enough to get it to post and boot, running a stress test
like Prime95 results in errors within a minute or so, yet
with ambient temp 10C higher the system not only passes same
Prime95 test for 24 hours, but can even pass it running at
50% higher FSB, Memory clock, and CPU frequency.

Trying to isolate the problem I'd changed power supplies 3
times with known/proven good 400W+ units, unplugged
nonessential cards, swapped video and memory, ran in minimal
configuration and checked every mechanical connection as
well as possible (including pulling/inspecting/reinstalling
the EEPROM and jumpers), all with known good/working parts.
It seems that the motherboard itself is simply intolerant of
quite mild temperature drop. Normally I'd just replace it
but this is quite puzzling, unique for such a small temp
span, and I'd like to get to the bottom of it. There are no
visable problems with the board, capacitors look fine and no
visable cracking or other physical abnormalities, though I
don't have the means to check this with a microscope,
especially since a motherboard is a bit wider than most
'scope's reach. Since this is a very popular motherboard
and I've not heard of anyone else having this problem (or
perhaps they just didn't isolate the cause as low temp?) I
wonder if this is an isolated flaw, but the closest
examination I could make showed nothing unusual and it does
work fine, never this problem (or any other that I'm aware
of) once room temp rises by 5-10C. The system case is very
well ventilated, ambient room temp never causes interior air
temp to go up much except immediately adjacent to heatsink,
as expected.

I thought about the battery but voltage on it reads OK and
it shouldn't explain the instability after booting and
running Windows. I don't recall if I ended up putting a
different battery in it or not but will do so just for the
heck of it.

One possibility I'm wondering about is whether one or more
of the capacitors are dropping their ESR as the temp falls,
if one or more are marginal and this is the cause. It might
be a bit difficult to easy check this though, I'd though
about possible touching a small light bulb to each in turn,
individually warming them to see if that made any
difference, but that could take quite a long time,
especially if multiple caps are involved, since it could be
necessary to wait till each cooled to try, isolate the next
cap. It also seems difficult to determine their core temp
without getting the outside can quite hot, as any
non-destructive temp reading would be of the outer can. It
seems a rather crude way of warming them too but I'm drawing
a blank as to how to individually warm a capacitor without
also warming the surrounding area, or at least minimizing
that as much as reasonably possible. I could instead touch
the leads with a solderin iron but would prefer to leave the
solder alone if possible.

I suppose I could take the opposite approach and warm up the
board then use freeze spray on each cap, but that doesn't
seem a very good approach either, since it might easily
(probably would) lower the cap temp too much, introducing
further failures that aren't present at 27C, not until much
colder, and it again seems difficult to thoroughly chill the
core of individual caps without changing surrounding area
temp by this small 10C thermal margin. Another possibility
I'd considered is temporarily placing a tantalum cap in
parallel with (as many suspect caps as possible), since
tantalums should be much more tolerant of low temp (IIRC),
but this also seems to be a lengthly, tedious process that
would best be avoided if anyone has a better idea?


Even if I don't solve this problem I wanted to at least get
this bit of info out there, that at the very least this one
board is effected by a relatively small temp change, but due
to the type of problem I wonder if it's more frequent, a
CMOS Checksum Error is not all that uncommon and "some" of
the occurrences of a Checksum Error might be misdiagnosed...
and some boards not even getting far enough to post "CMOS
Checksum Error" might do so if they were a little warmer.