How to Troubleshoot Complex Systems
This is the first installment of troubleshooting techniques at SQL University. I am your host, Chuck Lathrope (@SQLGuyChuck) and will be guiding you through some techniques I use every day. In coming articles, we’ll get into real world examples from my work life as a DBA.
For definitions, I often head to Wikipedia. I found the article for troubleshooting to be quite good and suggest you go read it without getting distracted by the many external links, then come back here (very important). One term I use a lot is customer, which I define as someone who is looking to you to solve their need, whether this is an external paying customer or an internal company employee or group. Though I also use the singular you a lot, I rarely mean for you to resolve issues by yourself. Think of “you” as being your team; hopefully you’re lucky enough to have one.
To become an expert at troubleshooting, you need practice and you need to know when to call in the experts. I have been using computers since the Commodore 64 came out and as an IT professional since 1990, but I still need to call on help. You can’t be an expert in everything, so don’t be afraid to call on others. Some experts I use are co-workers, Microsoft CSS, MSDN support forums, dedicated SQL site forums, and even Twitter.
Practicing troubleshooting can be done with any problem situation and I will guide you through the steps I have learned over a couple of decades. But, before we get into the technical details, you should understand the soft details. What are soft details, you ask? Simply put, you need to go through a cost benefit analysis of your problem and your ability to resolve it. Who is affected by the problem, what is their urgency for resolution, what are your skills in the problem area, and what is the cost of possible problem resolution initiation steps? Problem resolutions should be quantified in how much time they will take you to resolve and how much time your customer is willing to wait for resolution. In short, does your customer have the time to wait for you to find and implement a fix or do you need to call in the experts? You need to make this decision as quickly as possible, since time typically equals money lost. So if a problem has you stumped, you may decide to call the experts right away, but if you have time, you could delay until after further investigation and then call them if you still feel you don’t have a timely solution.
There are some main tenants to live by for troubleshooting. These are must-do items for you to be an effective troubleshooter. If you fail to do one of them, double your estimated time to resolution.
SQLGuyChuck’s Tenants of Troubleshooting
- Know your environment – Test it, monitor it, understand it, document it. This is fairly obvious, but also the most crucial. In complex systems, you need to understand and document the system from end-to-end. In computing environments, whenever you perform a particular task in the same manner with the same equipment, you expect the same result every time. If after testing or implementation of your system it deviates from your expected outcome, you need to kick start your troubleshooting steps. Many organizations create what are called “Runbooks”, which I recommend, as they will aid in documentation and have an easy-to-refer-to guide in case things go wrong.
- Track and document changes – The first thing that should come to mind is, “What changed?” This assumes you tested your system so it reliably produces the expected result. Do you have a change control board? If so, you may be documenting all your changes already, but if you don’t, you need to. You may not notice an issue until a week or month later in database environments. Check into automating change documentation (also called logging or change auditing). For example, I have implemented database triggers that log all schema changes to the databases I care about (SQL 2005+ only).
- Listen to customers – They may be the only hint that there is a problem, and they can help eliminate possible problem sources.
- Be wary the Web – Don’t trust everything you read on the internet, even if someone lists the exact same symptoms and how they solved their problem. Corroborate the solution with industry best practices. A typical example in the DBA world is troubleshooting issues with log files. There is a lot of bad advice out there on this subject, so be careful of the advice you listen to (hint: if Paul Randal said it, you can trust it).
- Research the solution – Because it can cause other problems.
If you have successfully implemented all the Tenants of Troubleshooting, you can follow these guidelines for troubleshooting below. If you haven’t, try the best you can to incorporate them into the troubleshooting steps, or it will just take longer to find a resolution.
SQLGuyChuck’s Step-by-Step Guide to Troubleshooting Complex Systems
- Define the problem – Be detailed. You need quantifiable measurements of the problem, so if you get a generic “it’s broken” from the customer, dig deeper until you know as much as possible about the problem.
- Reproduce – Maybe it was user error. Reproduce the problem or verify it actually did occur. If it was destructive, don’t try to reproduce, but find some evidence that it did occur. This may mean looking at log files (your Runbook should document where these are for your system), asking other customers of the system, or simply trusting the user that it did happen.
- Estimate costs of solutions – How much time do you think it will take for you to come up with a solution? If it’s too long for the customer, bring in the experts. If the customer can wait for you to try and fix, you may still elect to get the ball rolling for procuring outside help; depending on your company’s procedures and availability of help, it may take a while.
- Break it down – It may be overwhelming to think of a huge complex system in its entirety, so break it down into discreet components or groups of related components. For a database system, you may have web application, web server, network, middle-tier application, database server, database, security, and the data itself.
- Elimination – Eliminate components or groups of related components as sources of the problem, for example, those components of systems you know to be working just fine. Doing so helps narrow down where a problem is occurring.
- Identify the problem component – Once you have identified the problem component, you may need to break it down into smaller chunks to pinpoint the exact problem. Keep in mind there may be multiple causes to a given issue.
- Research the problem – Use your favorite search engine to find the error message.
- Discuss your solution with others – When you are not 100% confident your solution will solve or at least not hurt the system, talk it over with others, even if they don’t understand the issue at all. The act of saying it aloud and trying to give someone an understanding of what the issue is can often help you think it through more clearly, or your customer’s questions may remind you of aspects you skipped over in your testing of solutions.
- Plan and test your solution – If possible, this is very worthwhile. If the solution has potential to cause damage, test it on a similar non-production system first. Hopefully you have a dev or pre-production environment available.
- Plan and prepare for the worst – Expect things to go well, but cover your butt in case things go bad. In the database world, you have backups, and you have verified that your backup restores work, so you know you can get your critical data back if something goes unforeseeably wrong.
Once you have found and implemented the solution, document the change. I recommend creating a SharePoint list to document all your system changes because it is flexible, convenient, and free for Windows Server users. Also, if it makes sense, create processes to help prevent this kind of problem in the future, or create monitoring to alert you to this problem and link to your documentation on how you fixed it the last time.
I hope you enjoyed this article, feel free to comment (login required because spammers suck).