Redundant – but it still fails!
By Julius Grafton
Radiohead’s first set at Coachella was crippled by audio failures from the weakest link, their touring FOH console. While all professional console systems have dual power supplies and battery backup, not too many console systems are duplicated.
But lighting guys decided a long time ago to run dual control systems when some popular computerized consoles were deemed less reliable. It was common to see two consoles side by side – and this was across several major brands.
Norwest have made a name for themselves as the ‘go-to’ firm for massive event audio at international ceremonies like Olympic Games opening and closings, since the Sydney Olympics in 2000. They get contracted because nothing goes wrong, as their redundancy planning has proved bulletproof. Maybe they swear by the ‘Six P’s’: pedantic planning prevents piss pour performance.
So I was somewhat amazed to read about British Airways going offline – literally – in late May. Pretty well everything from their servers downstream fell over on one of the busiest holiday weekends of the year.
The knock on was disastrous, with flights on tarmacs neither coming or going, tens of thousands of bags trapped in the system, enormous queues of people lined up being told nothing by staff who knew nothing.
Worse still, once the situation became established, staff in the various lounges at Heathrow and Gatwick removed all the grog. Perhaps fearing an uprising?
In the aftermath I collected a couple of corporate disaster stories.
“My large company had its main datacenter go down recently. Basically there was a power outage to the site, which is very rare, the battery backup in the datacenter worked fine, however it was night so the solar panels (a very large array) didn’t produce electricity, and the 4 enormous 18 cylinders diesel generators had an unforeseen malfunction : their startup batteries had a too low voltage. UPS batteries got drained before electricians could get to the site and restore power, all the IT equipment shut down. Fortunately restarting everything went OK, aside from a few servers that didn’t like it.”
“In early 2011 our firm had an IT outage. We had an IBM mid-range computer, a small mainframe with mirrored discs. This had battery powered uninterruptable power supply (UPS). There was also a stand-by 150kVa diesel powered generating set which took 2 minutes to automatically power up and chop in, to rest the UPS.
“The wiring to the IBM box was a via a 60 amp supply from our incomer. Separately our various rings supplied to sockets for fused 13 plugs. We has a separate red ring via the battery UPS, for mission critical 13 amp plugged items, including comms to remote depots and some screens. These sockets were red, to avoid people plugging in items with heavy loads which would exceed the UPS capacity, electric fires for instance.
“A contractor working with us had been issued with a brand new PC which he correctly had plugged into a white socket. This non-IBM PC failed in an exciting ‘flash bang’ manner. The chap was not hurt at all, but he was very surprised, not to say shocked, but not electrically. Those in the open plan office likewise.
“The IBM box, running 150 screens over some 15 locations, came to an immediate halt, and was down for 6 hours, after which it was re-booted etc. Apparently, some sort of electrical shock wave (I am not an electrical engineer) had emanated from the flash bang of the new PC which was probably a Lithium Ion battery failure. This shock wave had apparently passed through the UPS which was unharmed, and knocked out one of the two independent power supplies within the IBM box.”
I worry about my media empire too. We use Xero cloud accounting software which significantly does not allow a data backup. They have all kinds of reasons for this. It means we hard print all our monthly accounts so we can still collect money if Xero goes to zero.
Likewise Dropbox holds our archives and sure, we have local synchronized files but Dropbox also deletes these if they are deleted at the cloud. It’s a weak link, and one that I have backups to cover.
We once had a RAID (redundant array of independent disks) system in our server that went down because the power supply failed. That in turn trashed all those redundant disks. We lost a lot of data that wasn’t on our accounts backup – significantly my entire dating email cache at the time, which was a disaster.
Finally our (now closed) TV unit had a multi terabyte server connected by a large ADSL link to an identical unit several suburbs away. I’ve always lived by the maxim that data doesn’t exist unless it exists twice.