I guess I've pissed off someone but last week, life took an OMGHUEGSHIT!!11!! on me. I'm almost afraid to ask what else can go wrong because I just know something will. I'm sorry for the length of this, but Murphy's Law rode me like a cheap date this time.
[Day 1]
I walk into work Monday and everyone is panicking. Apparently, our main production server crashed. No one can get to anything. I go into the server room and sure enough, the system (a Dell Poweredge 4600 Server) is powered down and I cannot get it to come back up. Damn.... Hell of a way to start the week, eh? It gets better.
I call tech support and we both agree that it's highly unlikely that all four of the power supplies would had died at the same time. We determine that the Power Distribution board in the system has probably failed. The next step afterwards would be to change the system board, but we should start with this first. No biggie, right? This system is under warranty and I'll have a new one in less than 4 hours. WRONG! The warranty expired in September and the fucking bean counters didn't renew it. Wonderful.
I get authorization to order a replacement anyway ($300) which I go ahead and order. I also made it very clear that I needed this part ASAP. Overnight is the best I can get. Ok, one day downtime, a PITA but I can live with that.
My supervisor then says that he thinks it would probably be a good idea to get a new system board just in case. That way there'll be no delay if it turns out that the PD board isn't what's wrong. I call back to order a new system board. It's $1400!! I put them on hold, contact my supervisor for confirmation that I can order this. After he gets back up on the phone, he gives me the ok. I order it also with the same urgency on shipping.
BTW, none of the people I spoke to in Dell's Parts center spoke very good engrish. I just know I was talking to someone in Calcutta. Think of any Indian accent you've heard in a movie or on T.V. and that's what they sounded like. Bad sign.
In the meantime, I figured I'd try to pull some files from Friday's backup. I backup to external Firewire drives so it's simple to do. The drive that has the Full Backups... Guess what? Corrupt Master File Table. The drive is inaccessible. This is just getting better and better. I have a few utilities that might be able to repair this but it's gonna take time.
[/Day 1]
[Day 2]
I'm pulling into my office building and see a FedEx truck outside our front door. "Great!! I can this thing back up in an hour or so..." Nope, nothing for me. FedEx Next Day means "In your hands before 10 a.m.". It's only 8. Oh well.
I get in my office, check the utils that have been running all night (or so I thought), and they had bombed out shortly after I left because of the corrupt Master File Table (MFT). Crappity crap. I try some others utils and leave 'em running. Not much else I can do at this point until I can get those parts.
9 a.m.... nothing
10 a.m... nothing
11 a.m... nothing
12 p.m... The system board arrives. Huh? I ordered that 2 hours after the Power Distribution board and that was just a contingency. This isn't going well. I check the order status of the PD board and it shows up as not being delivered until Jan 30th. That's NEXT Monday. WTF? Did Rajesh McSanskrit not understand that this it was critical that I get this ASAP? I paid for overnight, not overweek.
I tell my boss about the situation and he tell me to go ahead and swap out the system boards. We might get lucky. An hour later, I have the system back together and connected. Still no luck powering it up. Shit. Guess it wasn't the system board.
By now, Production Managers are wondering what the prognosis is. Can I get stuff off of the backup? We're ahead of schedule but we DO have a deliverable due on Monday. I check on my recovery util and it's not looking good. Oh gee; more good news. When it rains it pours.
Well, There is VERY little I can do until I get that PD board. I temporarily set up my remaining server as a DHCP/DNS server (The one that died did that), and tell everyone that there's nothing I can do. They may as well have a day off, courtesy of me and Dell.
[/Day 2]
[Day 3]
This day started off with a bang before I even GOT to the office. I leave my house at 7 a.m. and at the first light I come to, there's been an accident. I came upon it maybe 3 minutes after it happened. 15 minutes later, I'm rolling again. At the VERY NEXT LIGHT, the car in front of me BREAKS DOWN and I have to get out and help this girl push her car off of the road. A few miles later, I'm passing an Industrial park and they have the road blocked off letting some huge flatbed trucks with big pieces of machinery pull out. Then I get stuck waiting for a train to pass. I'm finally pulling around to the entrance of my parking lot, and a construction crew has the road blocked with a cement truck (at this stage, if I parked where I was, I would only be 20-30 feet away from where I normally parked...). I finally get into my office and turn on my desk lamp; The light bulb blows out. It is 8:35 a.m. and I live 16 miles away from my office.
My supervisor gets on the phone with our Dell rep and gets our warranty reinstated; not for the purpose of getting the parts for free, but simply so we can GET the fucking parts. Warranty customers take precedence over non-warranty parts and they were gonna hold this PD Board until Friday, just in case a Warranty customer needs it. Zuh? A $300 part is costing my company tens-of-thousands of dollars a day. Anyway, I should now have the PD within 4 hours; it's 12:30 p.m.
And there is no luck pulling anything off of the backup drive. If I can't get the system back up and/or the files on it are gone/damaged, it's possible that my company may die along with this system (small company; <100 employees nationwide). I keep reassuring the staff that I am doing EVERYTHING I can. I won't know anything until I can get this server up.
The PD board shows up at 3:30. I get it replaced and reassemble the system. The system powers up, but the hard drives aren't powering up. Great. The PD board killed my drives and I just lost millions of dollars worth of information. I call Tech Support again. Fortunately, we determine that the SCSI backplane is what is dead, not the drives. I'll have one within 4 hours; It's now 5 p.m.
7 p.m. - The replacement backplane arrives. I swap it out and get everything put back together. I get it all plugged back in and BAM! Everything comes up as if nothing had happened. All of the data is there and everything looks to be in order. I immediately reformat/chkdsk the failed external drive, and run a full backup on the other drive I use for full backups. I send out an company-wide email telling everyone that we are back in business and I'm sorry for the inconvenience.
[/Day 3]
I got home that night at 9:30 p.m.
No comments:
Post a Comment