Average time to first failure, or, What does Oops ... #500 mean?

Is it just me or has gmail been experiencing problems recently? Over the past few days I seem to get this message much more frequently. In fact until a few days ago I only really ever saw any error messages if I lost my connection.

I wonder if they've been trying something new? If so hurry up and iron out the wrinkles, what has happened to your QA? Or perhaps its infrastructure, it must be a barrel of laughs keeping all those servers working.

At Apachecon US in New Orleans last November I heard the employee of a Big Name make the self-evident but not so obvious point that if a hard drive has an average time to first failure of 3000 hours and you have 3000 of them then you can expect one an hour to pop its clogs. If you have 300,000 thats one every 36 seconds.

"So," we acolytes asked, in awe, "do you have a guy on rollerblades with a messanger bag coasting the asiles of the datacentre, with his ipod on, replacing these drives?"

"No" our guru responded, "when a certain proportion of the drives in san cabinet fail they switch the box off, and when a certain proportion of the cabinets in a rack fail they toss the whole rack"

Blimey. So if I need a new hdd I guess I need to go skip (dumpster) diving round the back of that datacentre, might pick up a whole san cabinet

Anyhow. I'm having to use the vanilla HTML version more and more, which is a shame because the ajax one is really pretty good, and genuinely useful, not something you can often say honestly about web applications in general or ajax in particular.

On the other hand you get what you pay for, and I can always ask for my money back ;-)

