At lunch the other day Hal was talking to me about responding to incidents. In sysadmin world, an incident means that something went wrong or was hacked or otherwise misbehaved such that it needed to be fixed unexpectedly. At Berkman, incidents are pretty frequent. Sometimes they happen during the day, sometimes they happen in the evenings or mornings, occasionally they happen late, late at night.
Hal said that it is very important, especially at Berkman, to take full responsibility for what has broken, fix it promptly, and apologize for what went wrong. We’re running all sorts of not-really-tested applications, we’re understaffed for the amount of stuff we’re doing, and we move too quickly to be able to really test and evaluate and audit every nut and bolt of our setup, so it is inevitable that things will break. People are tolerant of that, as long as the person in charge *takes charge*, admits to the error, fixes it, and shares any lessons learned.
I have no problem with this philosophy, I believe it is a good one. Often times things break for various reasons, sometimes completely out of your control, but when you’re the person in charge of the systems, the buck stops with you and, whatever happens, you’re ultimately responsible for it. That’s me. I’m the guy responsible for everything. It’s a role that I’ve been easing into over the past few months. But I’m still not comfortable with it.
My problem is a simple one, and it gets to my control-freak nature: I want to know everything that’s going on everywhere. I want to know what’s doing what where and why and when and how. This is the proper approach, I’m told, at least in an ideal world. Everything affects everything else, and one little misconfiguration can be the hole the hacker needs to break in, or the proverbial feather that brings everything crashing down. And because we are running a variety of old systems that have grown organically over several years, I’m still not comfortable with what we’ve got going, and so I’m not comfortable taking the responsibility.
When will I be comfortable? When everything has been freshly installed somewhere new, according to the procedures I outlined, with me directly involved in the process. And that’s not to say that my procedures and approaches are at all better than what we have. All it means is that I’ll know exactly what each machine is doing, and, as much as possible, each machine with be configured identically. Creative destruction is what I’m doing, the inevitable churn, and I think in the end we’re going to be in a better place.
And so I’ve been cleaning house ever since I got here, and gee has it been exciting. I’ve learned tons of new things, I’ve messed up tons of systems, and I’ve created tons of great new stuff as well. And we’re getting there. Slowly, with much effort and quite a few missteps and all kinds of unforseen circumstances, we’re getting there. And what I’ve been learning about more and more recently is how important it is to document processes and setups and to test, test, test before deploying anything. That means taking time to create a similar setup on a staging server, run the updates, see what broke, roll it back, fix it, run it again, et cetera until you’re 100% confident, then send out the email scheduling the switch, then do it at the appointed time, and then test it and, if necessary, back out the changes and leave it for another day.
It’s a plodding approach that goes against so much of what I stand for, and clashes with everyone else who wants everything done better and faster, but it is the only approach that really makes sense. More and more I understand why central IT at major organizations is so inflexible and moves at such a glacial pace. And I’m not going to say that I agree with this all the time — I think the bureaucracy that surrounds many small decisions is incredibly overblown and wasteful — but I am starting to really understand how they get the way they do.
The most interesting take-away from everything I’ve learned so far, I believe, can be summed up in four words: *Rich Graves is God*. Rich, you may remember, is that quirky guy who came into Brandeis to implement some new directory stuff and ended up pretty much building UNet, the software and server infrastructure that drives the university network, from the ground up. Probably a hundred times, and I am not exaggerating, I have seen something at Harvard working one way or another, and I’ve either known how they could make it better, because it was better at Brandeis, or I haven’t known, but by looking at the Brandeis documentation and Rich’s bboard posts, I’ve discovered how it *should* be done. Email. Spam filtering. Computer registration. Web space. User file storage. All of these things are better at Brandeis than at Harvard. And on the few occasions when I’ve suggested changes in line with how Brandeis did things, I’ve been met with only silence (see note).
By obsessively documenting and explaining and responding on message boards, Rich created an electronic paper trail that guides me today, even after both he and I have left Brandeis. And as much as he has been my greatest teacher in the field of system administration, I say, thank you Rich. And since I didn’t mean to turn this little rant into fan worship, I guess I should end now.
I honestly didn’t see myself becoming a sysadmin after college. I probably wasn’t qualified for the job that I was given. But as much as I’ll bitch about my work on the days when things are particularily bad, I really love that this is where I ended up. And that’s why I wake up every day excited to go into work and face the next challenge that the computer fates have seen fit to fling my way.
*Note:* I have received a note from a gentleman from the Harvard department that runs CAMail in which he told me they would be considering a modification to their virus filtering based on the suggestion I gave for how Brandeis handles a particular problem. Awesome!