Compaq Proliant 5500 Power Supply question & spontaneous reboot problem..

#11 August 24th 05, 06:28 PM

In article , Nut Cracker wrote:
memtest86 ... you boot from the floppy, fire up the program, and come back
in 2 or 3 days when its finished it testing loops.

Egads.. I guess my mail will be out of commission for a while (I use this box
as my mail server+db+apache/php, etc) while this test runs..

As a side note, this is a great test to run on my parents new Dell PC we just
got them thats sitting in my garage waiting to be configured.. (8-

This way I'll be more comfortable knowing it's in good working order before
sending it off to them in the mail..

Cool.. I'll see about starting it tonight and let you know what happens..

-- Rick

#12 August 27th 05, 05:50 AM

In article , Rick F. wrote:
In article , Nut Cracker wrote:
memtest86 ... you boot from the floppy, fire up the program, and come back
in 2 or 3 days when its finished it testing loops.

Cool.. I'll see about starting it tonight and let you know what happens..

Ok.. I let it run for about 2 1/2 days and it got 19 passes in the utility
with NO errors detected. None.. Zero.. Nada.. Oh well..

Now, one thing that I'm wondering about that I really hadn't given much
thought to at the time.. I had done a bunch of e2fsck's on my root partition
(/dev/ida/c0d0p4) and a number of problems were found at the time and
it didn't seem to matter how many times I rebooted, ran e2fsck, rebooted
again, etc. I always seemed to get the same errors.. I *was* thinking
that I might have some bad blocks on the root partition and this evening
after rebooting from the memory tester, I booted into emergency mode
and ran e2fsck -fc /dev/ida/c0d0p4 and it didn't find anything.. Go
figure.. Anyway, I've got it back up and running and did an fsck on
all partitions but nothing turned up.. I'll see how long it runs before
it reboots again..

-- Rick

#13 August 27th 05, 05:11 PM

Rick F. wrote:

Now, one thing that I'm wondering about that I really hadn't given much
thought to at the time.. I had done a bunch of e2fsck's on my root partition
(/dev/ida/c0d0p4) and a number of problems were found at the time and
it didn't seem to matter how many times I rebooted, ran e2fsck, rebooted
again, etc. I always seemed to get the same errors.. I *was* thinking
that I might have some bad blocks on the root partition and this evening
after rebooting from the memory tester, I booted into emergency mode
and ran e2fsck -fc /dev/ida/c0d0p4 and it didn't find anything.. Go
figure.. Anyway, I've got it back up and running and did an fsck on
all partitions but nothing turned up.. I'll see how long it runs before
it reboots again..

I don't believe that Linux will ever do a spontaneous reboot with
no error message in response to a disk error. We are looking for
something that kicks the legs out from under the OS before it can
react - which tends to point to CPU/memory/motherboard/OS.

I would try the following next:

Get the latest Knoppix CD, boot it up, open up as many programs as
you can, and let it sit. If you get no reboots, you know that it
isn't a CPU/memory/motherboard problem - that would hit Knoppix.
If you still get reboots, you know that it isn't the hard disk or
Fedora that is causing the reboots.

Try running on just the left power supply for a few days, then
just the right. If you still get reboots in both cases, it
isn't a power supply problem.

Try running on half of your ram for a few days, then on the
other half. If you still get reboots in both cases, it isn't
a ram supply problem.

Try running on half of your processors for a few days, then on
the other half. If you still get reboots in both cases, it
isn't a CPU problem.

As a last resort, you can look for a really cheap 5500 on eBay
(no memory, no disks, single low-Mhz CPU units are really quite
cheap) - ideally with local pickup - and start transferring
parts over one-by-one with testing at each stage.

Don't forget to post full details when you solve this so the next
fellow will know what to look for.

#14 August 30th 05, 12:54 AM

In article , Guy Macon wrote:

I don't believe that Linux will ever do a spontaneous reboot with
no error message in response to a disk error. We are looking for
something that kicks the legs out from under the OS before it can
react - which tends to point to CPU/memory/motherboard/OS.

Hmm.. An interesting tidbit of info.. On Saturday I shutdown any
unnecessary daemons so the OS itself is pretty much the only thing
running and the machine has now been up for :

uptime
16:43:14 up 2 days, 13:17, 3 users, load average: 0.08, 0.02, 0.01

IF this continues (no reboots) for the next few days, that will tell
me one (or more) of the following may be the case :

o Little system activity = no stress to tickle the underlying problem

o One of the daemons I killed is one of the culprits

o all of the above

For the record, the daemons I disabled are :

1) Dovecot (imap daemon)
2) Apache/PHP setup for virtual hosting\
3) MySQL 4.1.11
4) exim 4.51 modified to feed all incoming mail to sa-exim-dspam for
realtime spam filtering/rejection -- uses dspam as the backend
5) ncftpd
6) Samba

*MOST* of the time when the machine takes a hit, I'm not using it (I'm the
only user of it aside from hackers) and it's in the wee hours of the morning.
I guess I'd be tempted to point to Apache IF it's determined that one of the
daemons is at fault.. The only daemon I've not shutdown/disabled is my sshd
which is the only way into the box remotely.

Feel free to tell me I'm blowing smoke here or not.. I'd rather not re-install
another version of OS if at all possible (e.g. Knoppix as suggested) as it's a
pain in the $##$# to get things customized again.. I'll let this server sit
around like it is now for a few more days and if no reboots yet then I may
decide to re-enable a few of the daemons just to see what happens.

-- Rick

#15 August 30th 05, 09:07 PM

Rick F. wrote:

IF this continues (no reboots) for the next few days, that will tell
me one (or more) of the following may be the case :

o Little system activity = no stress to tickle the underlying problem

I am guessing no, but it's easy to test. Run a prgram that exercises
the CPU a lot Seti@home, one that uses up lots of memory (just open
a bunch of copies of any convenient file) and one that exercises the
disk a lot (are there any Linux defragmenters that are still around?)

o One of the daemons I killed is one of the culprits

That would be my guess.

I'd rather not re-install another version of OS if at all possible
(e.g. Knoppix as suggested)

(Yoda voice) Learn you must - re-install you must not! Powerful
is the slack^h^h^h^h^h^h force in Knoppix- boot it does from CD
without changes to hard disk. Unplug the hard disk you can!
Trust the force, young Rick; boot Knoppix and see if crashes it
does. Safe your Fedora intallation will be.

--
Guy Macon http://www.guymacon.com/

#16 August 31st 05, 05:26 AM

"Guy Macon" http://www.guymacon.com/ wrote in message
...

Rick F. wrote:

Now, one thing that I'm wondering about that I really hadn't given much
thought to at the time.. I had done a bunch of e2fsck's on my root
partition
(/dev/ida/c0d0p4) and a number of problems were found at the time and
it didn't seem to matter how many times I rebooted, ran e2fsck, rebooted
again, etc. I always seemed to get the same errors.. I *was* thinking
that I might have some bad blocks on the root partition and this evening
after rebooting from the memory tester, I booted into emergency mode
and ran e2fsck -fc /dev/ida/c0d0p4 and it didn't find anything.. Go
figure.. Anyway, I've got it back up and running and did an fsck on
all partitions but nothing turned up.. I'll see how long it runs before
it reboots again..

I don't believe that Linux will ever do a spontaneous reboot with
no error message in response to a disk error. We are looking for
something that kicks the legs out from under the OS before it can
react - which tends to point to CPU/memory/motherboard/OS.

I would try the following next:

Get the latest Knoppix CD, boot it up, open up as many programs as
you can, and let it sit. If you get no reboots, you know that it
isn't a CPU/memory/motherboard problem - that would hit Knoppix.
If you still get reboots, you know that it isn't the hard disk or
Fedora that is causing the reboots.

Try running on just the left power supply for a few days, then
just the right. If you still get reboots in both cases, it
isn't a power supply problem.

Try running on half of your ram for a few days, then on the
other half. If you still get reboots in both cases, it isn't
a ram supply problem.

Try running on half of your processors for a few days, then on
the other half. If you still get reboots in both cases, it
isn't a CPU problem.

As a last resort, you can look for a really cheap 5500 on eBay
(no memory, no disks, single low-Mhz CPU units are really quite
cheap) - ideally with local pickup - and start transferring
parts over one-by-one with testing at each stage.

Don't forget to post full details when you solve this so the next
fellow will know what to look for.

Played part of this hand once on a server that would reboot every few
hours. Day crew went nuts over it. Sys admin at first thought the original
setup was fubar. They checked logs and events, everything and all the found
was that load didn't matter, apps, users, OS, etc. They played with it for
several weeks without an answer, swore it was a software problem. So they
took it off line and passed it to us night guys and told us to play with it
and figure out what the problem was. Sr. tech had us strip it to bare bones
and start over. By the third night we had the answer, video card had some
glitch QA had missed and at random intervals it would kill the system. Only
found it by doing it the stupid way, swap a component and run it until it
failed or ran 24 hours without a hiccup. If all else fails that might be
the route to take.

KC

#17 August 31st 05, 06:49 AM

In article , Kevin Childers wrote:

Played part of this hand once on a server that would reboot every few
hours. Day crew went nuts over it. Sys admin at first thought the original
setup was fubar. They checked logs and events, everything and all the found
was that load didn't matter, apps, users, OS, etc. They played with it for
several weeks without an answer, swore it was a software problem. So they
took it off line and passed it to us night guys and told us to play with it
and figure out what the problem was. Sr. tech had us strip it to bare bones
and start over. By the third night we had the answer, video card had some
glitch QA had missed and at random intervals it would kill the system. Only
found it by doing it the stupid way, swap a component and run it until it
failed or ran 24 hours without a hiccup. If all else fails that might be
the route to take.

Hmm.. Latest saga on the server.. I went to try to ssh to it this afternoon
only to get a timeout from my office.. Doh.. I just fired up the screen only
to find the machine had rebooted and got stuck at the following message :

1777-Slot 6 Drive Array - Proliant Storage Enclosure Problem Detected
SCSI Port 1 : Interrupt Signal Inoperative - Check SCSI Cables.

Ok.. So I rip apart the machine and remove the SCSI enclosure looking for
any signs of anything.. Nothing looks amiss. I relocated the SmartArray 3200
card from slot 6 to slot 7, reseat, inspect, etc. No change.. Perhaps this is
part of my initial problem that just started to get a hard-manifestation?

Comments now? Interestingly enough, it prompts for whether I want to continue
or run the setup diagnostics.. I tell it to continue and it boots up fine
(except it initially complained about the disks being not shutdown properly,
solved by a nice e2fsck). I'm writing this on it as I type.. Any ideas on
how to proceed? Could this mean my SmartArray 3200 is on it's way out?

Perhaps it's getting to be time to shop over at E*bay for some spares..

-- Rick

#18 August 31st 05, 09:14 AM

Rick F. wrote:

Hmm.. Latest saga on the server.. I went to try to ssh to it this afternoon
only to get a timeout from my office.. Doh.. I just fired up the screen only
to find the machine had rebooted and got stuck at the following message :

1777-Slot 6 Drive Array - Proliant Storage Enclosure Problem Detected
SCSI Port 1 : Interrupt Signal Inoperative - Check SCSI Cables.

Ok.. So I rip apart the machine and remove the SCSI enclosure looking for
any signs of anything.. Nothing looks amiss. I relocated the SmartArray 3200
card from slot 6 to slot 7, reseat, inspect, etc. No change.. Perhaps this is
part of my initial problem that just started to get a hard-manifestation?

Comments now? Interestingly enough, it prompts for whether I want to continue
or run the setup diagnostics.. I tell it to continue and it boots up fine
(except it initially complained about the disks being not shutdown properly,
solved by a nice e2fsck). I'm writing this on it as I type.. Any ideas on
how to proceed? Could this mean my SmartArray 3200 is on it's way out?

Perhaps it's getting to be time to shop over at E*bay for some spares..

That's a ribbon cable, right? I have seen them go bad from people
pulling on the cable to unplug the connector - plus it will be cheap
and light (low shipping cost) to get another.

Keep a lookout for a server with no RAM, few/slow CPUs and no drives
that is available for local pickup. They often go for less than a
card in them would sell for...

#19 August 31st 05, 04:14 PM

In article , Guy Macon wrote:

That's a ribbon cable, right? I have seen them go bad from people
pulling on the cable to unplug the connector - plus it will be cheap
and light (low shipping cost) to get another.

Hmm.. Is this something I can p/u at my local Fry's or equivalent
sort of store -- I *think* it's one of those 68 pin ribbon cables
or something around there (haven't counted the contacts).. I just
wonder if it was marginally bad and whether or not it's my culprit
for the reboots -- particularly if I'm getting occasional odd
behavior with one or more signals being messed up on the cable..

Keep a lookout for a server with no RAM, few/slow CPUs and no drives
that is available for local pickup. They often go for less than a
card in them would sell for...

Yeah.. I'll keep my eyes open.. I picked up this server from a guy on
my local Craigslist.. Thanks!

-- Rick

#20 September 2nd 05, 10:01 PM

In article , Rick F. wrote:
In article , Guy Macon wrote:

That's a ribbon cable, right? I have seen them go bad from people
pulling on the cable to unplug the connector - plus it will be cheap
and light (low shipping cost) to get another.

Hmm.. Is this something I can p/u at my local Fry's or equivalent
sort of store -- I *think* it's one of those 68 pin ribbon cables
or something around there (haven't counted the contacts).. I just
wonder if it was marginally bad and whether or not it's my culprit
for the reboots -- particularly if I'm getting occasional odd
behavior with one or more signals being messed up on the cable..

Ok.. I picked up a new SCSI cable down at my local Fry's this morning
and just plugged it in and no go -- same problem as before.. I guess
I'll be keeping my eyes open for a used SmartArray 3200 on E*bay or
whatnot..

Is it safe to assume that the cages don't go bad ever or very infrequently?
About the only thing I've yet to try is to unplug ALL of my SCSI drives
and reseat them in the cage.. I'm leaning towards a bad controller..

-- Rick

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
FS PRINTER PARTS trays fusers drums printheads -- oki fujitsu hp genicom epson ibm dec jetdirect laserjet lexnmark qms okidata ml320 mannesmann tally printonix tektronix qms toshiba zebra otc ibm lexmark intermec dec compaq montreal canada toronto o	[email protected]	Printers	1	May 29th 05 07:18 PM
GW Power Supply Feedback	RDBrimmer	Gateway Computers	9	October 22nd 04 07:15 PM
PSU Fans	Muttly	General	16	February 13th 04 10:42 PM
Compaq proliant 1200 power supply pinout.	Jurgen ten Buuren	General	0	December 2nd 03 09:07 PM
Compaq Proliant 5500 won't power up	Jim Balson	Compaq Computers	7	September 19th 03 12:05 AM