not a good week to be a HardDrive

on top of everything going on this week, this past weekend i got an email from one of our servers, let’s call her File Server #2.

The system has detected the following event:

SNMP Trap: 3036

System URL: http://CCRFS2MFE:2301/

Date time: 02/01/2008 05:08:12 PM

Computer: CCRFS2MFE

Source: Storage Agents

Type: Warning

Category: (4)

Description:

A ‘Physical Drive Status Change’ trap signifies that the agent has detected a change in the status of a drive array physical drive.

Details:

IDA Physical Drive Status ‘PREDICTIVE FAILURE’

Error Code 0

Drive Bay # 3

Bus # 1

Controller Slot # 1

So i get into the server and notice that FS2, Contoller1 Drive 3 has gone into PFA status.  PFA is means “preventive failure alert”.  A condidtion has occured that indicates that a failure is likely.

So i pull up the warrenty information and sure enough it’s out of warrenty so ordered up a replacment drive.  It came in today.  Before i removed the bad drive and installed the new one, i decided to back up the logicial drive to File Server #3.  All that FS3 does really is backup our music system.. so i removed that backup, and began replacing it with the backup of FS2.  While all that is going on i get another email,

This message was generated by the Adaptec Storage Manager Agent.

Please do not reply to this message.

Event Description: Logical device is degraded: controller 1, logical device 2 (“Secondary”) Event Type: Warning Event Source: CCRFS3MFE

Date: 02/07/2008

Time: 03:23:56 PM CST

What? two failures at once?  not really likely.  so i check it out.. sure enough FS3 decided to take a dump too.  So again i check the warrenty of this machine, sure enough. out of warrenty.  This time i call up HP on the super secret clear channel customer support number.  Gave them my password and talked to them.. they acknowled that the system was out of warrenty but they decided to send me a new HD anyway.  I’m greatfull that i’m getting that addressed.. now i just have to wait for that to get here, fix FS3, backup FS2 to FS3, fix FS2, and then on top of all that, FS1 needs to be restructured.

Looks like i’m going to be planning an all……. nighter repairing all these machines.

Haven’t seen so much trouble since the time one of the file servers at the bank decided to take a dump on the external harddrive rack.  I usually keep one hotspare in the array for just such an occasion.  One drive dies, the hotspare takes over.. but this time, two drives crashed at once.  thank God for RAID.