How to Troubleshoot Example
In this second installment of troubleshooting techniques at SQL University, I will be covering a couple real-world situations that should hopefully give you some ideas for your next troubleshooting scenario. If you haven’t read my first class lecture, please do so now.
I had just launched SQL 2008 Enterprise on an active-active-passive cluster configuration a few days prior to our NOC team calling me about hundreds of error messages in our error reporting. The first thing I did was to jump onto Management Studio (SSMS) and run one of my favorite locking script from SQL MVP Erland Sommarskog that showed a -2 blocking spid and many other spids with WRITELOG waits. I copy the script output to excel for later review every time I use it. Search engines don’t do well with searching for spid -2 blocking, so that didn’t get me very far. Everything has pretty much changed with our migration to SQL 2008, so we had to figure it out for ourselves. One of our developers found that a remote server was erring badly and fixed the issue and blocking went away. Now I didn’t have a clue about -2 blocking spid meant, but prayed it wouldn’t come up until after SQL PASS conference, where I could ask my friend Bob Ward on CSS team about this new issue. The previous PASS conference, he helped me with -4 blocking spid (tempdb contention) and this year he did not let me down. He was able to look it up and found it was from a lost DTC transaction.
The WRITELOG waits were still causing us grief and Bob did mention that writing to the log file would never be a blocking action. So, now I had to find the cause for WRITELOG delays. Looking at the SQL Server Error Log, I found a trend at around 4am, there was a bunch of IO stall messages of the kind saying, “SQL Server has encountered 450 occurrence(s) of IO requests taking longer than 15 seconds to complete on file [D:\...]”. And this truly means what it says, for 450 attempts to write to this file, it took 15 seconds to complete. This is horrible! Typically you want to see <15ms, not 15,000ms.
The first step in troubleshooting is complete, we have defined that a -2 blocking spid has caused issues and know where it comes from and we have an IO issue with writing to the log file. We can’t really reproduce the issue other than waking up at 4am to watch what is going on or schedule profiler to dump to a file at that time. Thinking back to my tenants of troubleshooting, the first is to know your environment. I built the servers, configured them on a dedicated SAN, did tons of SQLIO performance testing to make sure all looked good and ran SQLIOSTRESS to validate the configuration before I put them into production use. The second tenant is to document all changes. Well for us, pretty much everything has changed, so that wasn’t going to help. The third tenant, listening to customers, also didn’t help as they were just complaining that everything was slow and that wasn’t anything we didn’t already know.
Since IO issues were still a problem, we needed to evaluate possible solution paths. We could call in the experts, take the time to find heavy IO consumers and fix them, or leave as is and hope it was a one-time occurrence. The last option was thrown out right away as this had occurred twice in two days around the 4am time period. We decided to take the time to figure it out ourselves as we are pretty smart guys and know our system well.
The next thing we did was to break down the system into smaller pieces to see if we could weed out any components as a possible culprit. We have the SQL cluster, HBA’s (fiber cable cards that allow server to talk to SAN), the SAN itself, SQL code, and/or SQL jobs. SQL code was thrown out or at least bottom of pile as code change wasn’t part of migration. SAN and HBA were well tested, but not with same workload that was happening now, so moved that near bottom of list. We checked SQL jobs, and found no overlaps of big IO hitters.
Next we turned to our monitoring on SCOM. The reports showed heavy IO happening on one of the disk groups. I shifted some of the data and log files to another diskgroup that had double the number of spindles. This would help them by not having enough IO power. But, the next day same thing happened. Next step in troubleshooting was to discuss with our storage team lead about our system and we came to the realization that we were not looking at the entire picture of our system! The one thing we missed was the use of near-line disks on the SAN (slow SATA disks for long-term storage). We found that two database restores were overlapping and some jobs get kicked off during this time period also. Well, near-line disk does not do well with multiple threads like SCSI does. Adding the missing SCOM reporting showed that indeed these disks were getting a lot of work around the 4am time period.
Our solution was simple; shift the restores to not happen at same time, move databases around to faster disks, and setup SCOM alerting for slow IO waits. I planned out the scripts involved with moving database files, tested the syntax and made emergency change during off-hours. No more calls since the change, so we found our culprit. The final root-cause was when we were over using the near-line disk with tons of IO writes, we cached starved our SAN so that nothing else could use write-caching causing a slowness for all servers using the SAN.
I hope you enjoyed the long read.