BOFH: Power corrupts, uninterrupted power corrupts absolutely

Year 2015 - Episode 13

Episode 13

"THE POWER'S OUT!" the Boss shouts, blundering into Mission Control like a robotic vacuum in super-random turbo mode. "THE TRANSFORMER DOWN THE ROAD HAS EXPLO... hey, why are your lights still on?"

"They're on the UPS. Aaaaaaaaaaaand... wait for it..." I say, after a slight flicker; "... on the generator, too."

"Why is your room on the UPS and generator?"

"So that we can shut systems down and bring them up in an orderly manner," the PFY says.

Which is a lie.

"We need to ensure that the server room is able to be recovered in the most optimal manner when the power comes back on," he adds.

Another lie.

"And of course sometimes we're here overnight till the power is restored."

Absolute crap. The only time either of us would be here overnight is if we'd spent our cab fare on booze after the tubes had stopped running.

In actual fact, we long ago realised that any power outage lasting longer than 5 minutes may as well be a half-day outage, because everyone decamps to the pub across the road and is reluctant to return. As a result, a few changes had been made in the last couple of months to "optimise the use of our resources". All access switches are now on mains power and go out at the same time as the users' desktops – along with the phones, wireless access points and any other PoE devices.

The server room UPS and generator now only supply crucial services – Mission Control, the core router and firewall, the webserver (so staff smart enough to check our website with their cellphones think we're still 20/20), our ET Legacy server, Mission Control air conditioning and – in winter – the PFY's 3-bar heater, which has an efficiency in the low teens.

The server room isn't completely unprotected though – we slapped in a small mains-fed UPS with a battery life that relies heavily on our 1-minute outage autoshutdown scripts kicking in.

"Why's it so quiet?" the Boss asks, gesturing at the wall between us and the server room.

"We've shut down most of the servers to protect them from any stray capacitance or line spikes from the power outage," the PFY says.

Honestly, if he could fabricate any better he'd be a 3D printer. Although the media would vary in consistency and would need to be baked afterwards, obviously. In ACTUAL fact, the auto-shutdown scripts I mentioned earlier probably need tweaking based on the age of the dodgily sourced East European batteries – calculated in lost capacity per month.

"Then why do we bother even HAVING a UPS and generator, if it can't keep us up and running?" the Boss asks.

"In the old days, when we had around 40 servers with a total power requirement of about 30 kVa – including Mission Control – the 100kVA generator fed the two redundant 30kVA UPS units with sufficient capacity left over to power one of the lifts," I reply.

"These days," the PFY says, getting in on the act, "with the number of servers we have, with their development, test, pre-deploy and production instances – even WITH virtualised platforms – there's barely enough capacity left over to power the La Marzocco GB/5."

"What's that?"

"It's a CMS," I say, not muddying the water by telling the boss that in this instance, the C stands for coffee. "So because we're borderline like this, we've pulled power back to the absolutely crucial services – that way we get a chance to shut things down in a stable fashion, while maintaining a work platform from which to launch recovery services."

Complete bollocks.

"So our UPS and generator don't work?"

"They would work perfectly well until they're needed, at which point the two – now non-redundant – 30kVA UPS units trip into bypass, being unable to deliver the 40-odd kVA that's demanded of each of them. The power goes out because the generator has a 30-second start delay built in and hasn't autostarted yet. About 25 seconds later, the generator autostarts, runs up until the power quality breaker is energised and determines that the voltage has been consistently stable for 10 seconds, at which time it trips the breaker to cutover from the dead mains to our generator supply. The breaker trips, the generator internal contactor shits itself – because the start current demand from everything downstream of the UPS is about twice the capacity of the generator – and the contactor opens, but not before sending about 1/2 a wave of around 160v down to every piece of downstream equipment, with a big spike when it chops off. With the contactor open, the power quality breaker de-energises open so that when the mains comes back on, the power to the equipment will too. Only the generator is still running at no load, and the contactor – which is a combination electromagnet and bimetallic thermal shutoff – will reset itself within 8 to 10 seconds. The power quality breaker sees a clean supply, cuts over, the contactor trips, the power goes off with another spike, the contactor resets and the process repeats."

"The equipment which isn't ruined by the spikes," the PFY says slowly, "is usually the stuff which is 'protected' by crap rack breakers which can't handle the start current either. But it's always the cheapest stuff which is in those racks: the good equipment is always in dual-fed racks with D-curve breakers which are more than happy to pass spikes, even though they claim to have surge protection. The bad news is that with those crap racks out, the contactor trip voltage now rises higher – sometimes as high as 220v – so the spike voltage increases with every reset."

"What can we do?" the Boss gasps. "Can we get a bigger generator and bigger UPS units?"

"The generator for this place is on the roof and it came with the building. About the only thing that could get that off the roof is an earthquake – and then it would only travel vertically."

"So what do we do?"

"We rate the services we consider most critical and keep them running. The rest we shut down."

"How do we choose?"

"I guess we discuss it. Over a pint. At the pub. Now?"

"At the pub?"

"Well nothing's happening here, is it? And with a dead transformer we're talking about at least a day to reinstate. So lets have a couple of quiet ones and discuss priorities."

"I..." the Boss says, thinking about it but seeing no alternatives. "I guess so."

Now we just need to ensure we get several pints into him before he realises that (a) we're not paying for our drinks (b) the pub is the only building on the block apart from Mission Control with power and (c) there's a thick cable which seems to join our two buildings.