Fail safes are weaknesses

I’ve been designing systems for some time and although I admit I’m not a designer of ‘big’ systems architectures I do have my moments. I also consult with the ‘big boys’ and I’ve had an IBM consultant help me with some details of a redesign we went through to get to our current web cluster several years ago.

However I’ve now reached a conclusion – simple systems tend to work better and have high uptime compared to complex ones with multiple redundancy routes and failovers. (This is the point at which I extrapolate from my own experience to give a huge generalisation but bear with me).

Here are two examples:-

1. The Open University recently implemented a SAN system following harassment/encouragement from people like me over issues about the OU Exchange server and it continually falling over due to being a single server. They invested a huge sum in this SAN architecture and run multiple services off it including mail and file services. Since implementing it they’ve had lots of problems and much downtime, they’ve also (as a result of another enterprise system forcing patches over the network) had issues about availability of the service through the network and also they’ve (and this is where I start giggling) run out of storage space due to not calculating the amount of stuff accurately and/or running out of funds to buy more. To alleviate the problems locally we’ve moved large amounts of storage to a local machine (server with a big low cost multi-terabyte disk) which we back up to tape. This works and has about as much capacity as the whole SAN system at a fraction of the cost.

2. Following advice from an IBM consultant and also from my own ideas about server clustering we implemented a network cluster of IBM 1U x306 systems connected to a single DB. These were problematic from the beginning but having ironed out the bugs we found they were working OK for a while but sometimes synchronisation became a problem also we found that people would add server things (e.g. scheduled task) to a single host and it not be mirrored so that if that host failed the automated tasks failed to run. We found that the cluster sometimes had a host fail to serve pages but not fall down and so the ‘failover’ on the network would not happen and people would get served nothing. We eventually decided to remove all but one of the cluster group, in effect having a single web front-end server and single DB (both backed up to tape) this is low cost, simple and has (touch wood) not failed since we removed the other hosts.

So the latest buzzword is server virtualisation and I’ve seen it in action and it’s very impressive. Will it work for us? – I don’t know yet but I sometimes get the feeling that low cost, simple and straightforward is better than trying to catch the tide of virtualisation/mirroring/consolidation/thin client/SAN/NAT/SATA/RAID… time to talk to my IBM man again.


About willwoods
I'm Head of Learning and Teaching Technologies in the Institute of Educational Technology at the Open University.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: