Tuning Diagnostic Queries

Most of the time when we talk about tuning, it’s related to improving the performance of a production workload. Maybe (hopefully) we’ve gone as far as tuning more “hidden” code like stored procedures, triggers, and the black boxes that are user-defined functions. But what about the other queries we run against our production servers?

Maybe we use 3rd-party management/monitoring software, or in-built tools like Activity Monitor or Management Data Warehouse (aka Data Collector) that periodically checks up on our servers. In these cases, if we identify these systems as a performance problem, we’re pretty much stuck with either reducing the polling frequency, abandoning the use of the system entirely, or pestering the vendor until the problem is fixed. That’s pretty straight-forward.

There’s another class of queries we run against our production servers, though: diagnostic queries. If a server is in performance trouble, or we need to go troubleshooting for blocking, or even if we’re proactively searching for issues, we automatically whip out our trusty folder of scripts and start running queries to start to narrow things down; queries that could potentially make the problem worse by either adding extra load to the server, or even adding unnecessary time to the troubleshooting process.

It’s impossible, of course, to eliminate the need for the database engine to do some processing to return the results of our queries (the observer effect), but how much impact are our queries having on the server? Maybe in the grand scheme of things, not a lot. During a troubleshooting session, however, things are very different: people are anxious for answers, the problem needs to be solved now, and we’re working as hard as we can to not only fix the problem, but above all else, to not make things worse.

Now, I’m going to give an example, and I have to apologize in advance to Paul Randal and/or Glenn Barry because I’m about to single them out. Sorry, guys.

A good while back, Paul posted an excellent diagnostic query to analyze wait stats (note: I never saw Glenn’s original query), and I had stolen borrowed used it to troubleshoot some issues in the environment I work in. Of course, I played around with it before that, to see it was indeed a useful script to keep around. But when I ran it against the production server in peril, I was floored because it took so long to return the results I needed to see. I was also puzzled that after I reset the wait stats and then ran the query a few seconds later, the results… just didn’t look right for what I was expecting. What was going on? It was just a relatively simple SELECT statement from a DMV, which is usually pretty quick. I turned on STATISTICS IO and STATISTICS TIME and ran the query again:

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

(16 row(s) affected)
Table 'Worktable'. Scan count 7, logical reads 4664, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 812 ms,  elapsed time = 3469 ms.

Shock and/or awe. I think we found the problem! The diagnostic query was consuming a tremendous amount of resources just to tell us that the system was having its resources consumed tremendously.

Well we’ve got to do something about that, haven’t we? Ultimately the query got pulled apart, and using a pre-computation approach involving table and scalar variables, this is where it ended up:

SQL Server parse and compile time: 
   CPU time = 16 ms, elapsed time = 22 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
Table '#58739F6F'. Scan count 0, logical reads 92, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 1, logical reads 192, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 15 ms,  elapsed time = 3 ms.
Table '#58739F6F'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

(16 row(s) affected)
Table '#58739F6F'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 1 ms.

Much better. (You can find the modified script on the Scripts & Code Resources Page. Note that it doesn’t do exactly the same thing as the original.)

So what can we learn from this?

  • While DMVs are usually fast, doing further analysis on the results can be expensive, depending on what we do, of course.
  • Test diagnostic queries for performance just as much as for functionality.
  • Fully test diagnostic queries long before they get thrust into the heat of battle. This is sword-sharpening for DBAs.
  • For the scripts that just… perform poorly… and there’s nothing more we can do about it, add comments to the script body that say as much. Whoever is about to run the script should be made aware that the query will take some time to finish. The warning will prevent situations like worrying about whether or not it’s working, or it even may alert the person to not run the script at all and use a different/similar method instead.