We all know about the National Australia Bank snafu that has rendered thousands of people unable to get to their funds over the past week. We can have nothing but sympathy for their customers’ position, obviously. Banking is one of the fundamental planks of our societal infrastructure, and when it fails, its repercussions are often disastrous for people.
Last night came the “revelation” that the outage was caused by human error. Having been in the IT industry for 15 years, this came as no surprise. It was going to be human error irrespective – either at the level that the architecture of the system that failed was inadequate right from the start, that processes were flawed, or that someone had Pressed The Wrong Button. Unfortunately, it turns out it was the latter. I feel for the guy.
As a senior tech and architect, I am well acquainted with the risks of being in the IT game. People fail to appreciate the position we put ourselves in every day when we go to work. IT systems are by their very nature an intricate, complicated bitch of a house of cards, and we know that the balance of probability is that at some time in our careers we’re going to bump one and knock it down. This knowledge in some ways renders getting out of bed of a morning clinically insane. You never know when it is going to be your turn. One little mistake can take out a company for several days. When systems do break and break but good, they take Time to get going again. We have to wait on vendors, or rebuild systems from scratch, or restore terabytes of data from tape. It’s not always as simple as “Have a backup system”, as much as we like to think so.
Because computer technology pervades modern life, we as its keepers have the potential to take companies offline and adversely affect hundreds or even thousands of people. And there is just nowhere to hide.
This is compounded by the fact that mostly the systems we look after run on products bought, strictly as-is, with no liability express or implied, from software vendors whose myriad developers may or may not have been up to the task allotted to them. This extends down into the hardware these systems run on, because even it has embedded code, written by humans.
We rely upon these faceless people to get not only their code right, but also the documentation – the advice and procedures – that they publish. If it is not right, we will be steering the ship straight at the rocks full speed ahead.
So we tiptoe through, we check everything six times, and we test religiously, if we know what’s good for us. Things take a long time to get done because we’re making sure that whatever action is proposed is going to be right, because if it’s wrong, the Big Boss tends to come calling.
As you move along in the industry, you can move into architectural/design positions, where you have the scope to screw up larger and larger systems and cause larger and larger outages. You would think there would be some commensurate prestige for the added responsibility, but the truth of the matter is that the field is so bloody esoteric that people just don’t get what goes on behind the scenes. Compounding the thanklessness of the job is the fact that whilst any outage or even slight degradation in service results in outraged screams from users, huge improvements in performance or functionality are quickly forgotten and taken for granted as the new minimum requirement of service if indeed they’re noticed at all.
It begs the question why do we do it and how do we cope? Sometimes we wonder, we really do. But we do it because we’re good at it and deep down we don’t trust anybody else to do it. We cope by getting Good, and by getting acquainted with every support resource in existence for the technology we support. It’s get good or get out.
So next time you want to have a go at an IT guy for not getting things fixed fast enough or for things being down in the first place, remember that he is constrained by the quality of the systems his managers let him buy & build, by the quality of the products and support resources supplied by the vendors he is told to use, the limitations in terms of speed of how fast data can get from place to place, and a dozen other things. He’s going as fast as he is physically able to.
If you want to get things moving as fast as they possibly can, bring coffee and pizza. And then let him work.