Picture for a moment the troubleshooting guides in the back of appliance manuals...
| Problem: | Solution: |
|---|---|
| Food not cold |
|
If fixing a refrigerator were this easy, appliance repair personnel would have a hard time finding work. What happens in reality? In reality, this simple simulation is just a model. The most common things that can go wrong with a system and what people have done in the past to fix them.
Many IT shops attempt to create troubleshooting guides like this, in the form of run-books or other documentation. I don't think that's bad, in fact I think documentation in this form is some of the most helpful documentation a new incoming administrator can have.
However, especially with the trend towards outsourcing, i've noticed the idea cropping up that these things can be captured in greater detail. We all start to think that if we can just created a detailed enough run book for a particular site, then it would be possible to hand that book to just about anyone and have them successfully run the site.
The reality of course is that server applications are far too
complicated to capture every possible outcome in a troubleshooting
matrix. There are too many branch points, too many possibilities, and
too many pieces of information that have to come together to form a
complete picture. This is obvious to most people, but there is still
the notion that enough
useful scenarios can be captured. Using a
guide like this is what I call troubleshooting by rote.
Someone has
developed the steps, the person responsible has practiced them over
and over again, and can execute them when necessary. How often does
the refrigerator guide
actually solve the problem for you?
This is how people often try to train their problem solving personnel, but there is a significant problem with this approach.
This rote approach is entirely different than the way a natural troubleshooter operates. A good troubleshooter builds and discards mental models as necessary for the moment. Every idea about where the problem might lie becomes a theoretical hypothesis to be proven by experiment. Sometimes entirely new experiments have to be developed to prove or disprove a theory. Each new piece of information is fed into the model to make it more accurate. It's an application of the scientific method.
This poses two problems for those documenting their troubleshooting procedures. First, the documentation that a good problem solver needs is generally not just the simple run book or troubleshooting matrix. They need to know the components of the system and how they work together. By necessity the static matrix needs to make a lot of assumptions about where the problem might lie; a master problem solver is a master problem solver precisely because they do not make assumptions (or if they do, they are quickly willing to discard them as just that). So while having a run book can be useful, it is only really useful to a particular audience. There is still a need for passing on more detailed information about the system.
The second problem is that this more detailed level of problem solving is never going to be taught to someone by feeding them a matrix of problems encountered previously. Learning everything about a system only gets someone so far. The primary activities when troubleshooting technical problems are logic, reason, sometimes some math, and a methodical approach. Every situation is different and unique; teaching people that they can all be handled by a single set of steps denies the intuitive nature of troubleshooting.
That in the end, is what makes a troubleshooter a great troubleshooter. It can seem like the person working the problem has almost an intuitive sense of the system, and indeed that is very nearly what is happening. At some subconscious level, the person solving the problem is generating a feel for the problem. It is tempting to have someone this familiar with a system create rules, such as If memory usage rises more than 20% in 30 minutes, then do procedure X from the run book. This misses the essence of the problem solving activity. Often a good problem solver can't express what turned them towards a particular solution. I think they are afraid to admit to others (and maybe themselves at times) that something just felt wrong.
It is this intuitive sense of what is going wrong that makes the best troubleshooters so effective. The key for quick problem resolution is to get the best, most relevant data into the hands of your best problem solver as soon as possible, and let them go to work. Because one good troubleshooter, in tune with the system, thinking about the problem is generally much better than a room full of people executing troubleshooting steps by rote. To teach it to others, sit some people down next to him or her as they work, and let them take the reins at times.
Teaching and developing this intuition in people requires recognizing it, and giving people the opportunities to practice. This is the essence of why troubleshooting is difficult to teach to other people. Each person must find their own style, and listen to their own inner voice about which pieces of data are relevant and which are less important. Only through experience can this develop, and all the documentation in the world will not turn someone who hasn't practiced this into a top notch problem solver. They need to study next so someone who's good at problem solving, get lots of practice, and be given a chance to grow.