Feb
20
2013

Startup Expression Predicates

When we write T-SQL statements, what we're really doing is describing what data to return. It's then up to the internals of SQL Server to best decide how to most efficiently return the data we asked for.

Sometimes, there's extra information we know about, but that SQL Server doesn't (automatically). Letting SQL Server in on this seemingly redundant information can change how efficiently the data is accessed and returned.

In this post, we'll walk through a simple parent/child example that exploits a partially denormalized table schema to improve join performance to the child tables. The performance improvement comes through SQL Server producing query plans that contain Startup Expression Predicates, which effectively prevents certain parts of the query plan from executing in some cases.

 

Test Setup

The first thing we need to do is set up the tables. We'll need a ProductTypes table, a parent table (Products) and two child tables (ItemProducts and ServiceProducts).

CREATE TABLE [dbo].[ProductTypes]
(
	Id tinyint NOT NULL PRIMARY KEY,
	Description varchar(50) NOT NULL
);

CREATE TABLE [dbo].[Products]
(
	Id int NOT NULL PRIMARY KEY,
	ProductTypeId tinyint NOT NULL
		FOREIGN KEY REFERENCES [dbo].[ProductTypes](Id),
);

CREATE TABLE [dbo].[ItemProducts]
(
	ProductId int NOT NULL PRIMARY KEY
		FOREIGN KEY REFERENCES [dbo].[Products](Id),

	ItemColumn int NOT NULL
);

CREATE TABLE [dbo].[ServiceProducts]
(
	ProductId int NOT NULL PRIMARY KEY
		FOREIGN KEY REFERENCES [dbo].[Products](Id),

	ServiceColumn int NOT NULL
);

In this type of design, there will only ever be a single row in one of the child tables for each row in the parent table. This is typically handled by some form of business logic (stored procedures or views) and enforced by constraints, but I want to keep this example simple, so I'm only mentioning this for the sake of completeness, and what the data is going to "look" like.

Okay, let's add some test data so we can run some queries:

INSERT INTO [dbo].[ProductTypes](Id, Description)
	VALUES
		(1, 'Item'),
		(2, 'Service');
		
INSERT INTO [dbo].[Products](Id, ProductTypeId)
	VALUES
		(1, 1),
		(2, 2);
		
INSERT INTO [dbo].[ItemProducts](ProductId, ItemColumn)
	VALUES (1, 50);
	
INSERT INTO [dbo].[ServiceProducts](ProductId, ServiceColumn)
	VALUES (2, 40);

Now we have rows representing one ItemProduct, and one ServiceProduct.

 

Querying the Data

First let's start by looking at a typical query that might be run against these tables:

SELECT
	p.Id AS ProductId,
	p.ProductTypeId,
	COALESCE(ip.ItemColumn, sp.ServiceColumn) AS OtherColumn
	FROM [dbo].[Products] p
	LEFT OUTER JOIN [dbo].[ItemProducts] ip WITH(FORCESEEK) ON
		ip.ProductId = p.Id
	LEFT OUTER JOIN [dbo].[ServiceProducts] sp WITH(FORCESEEK) ON
		sp.ProductId = p.Id;

(Note: the hints are not standard, but are needed for demonstration purposes; I got a nested loops/table scan plan by default. See the final section of this post for some extra discussion.)

Since each product row will only exist in one of the child tables, we have to use LEFT joins to get any results. The query plan looks like this (click for full size):

We can see that for each row in the Products table, SQL Server must join to both child tables in case there are rows there. Legitimately there could be, as the only thing preventing that is our business logic. SQL Server doesn't understand that, so it has no choice but to ensure correctness and do the extra work.

Here's where the magic comes in. We know that for a given ProductTypeId, rows will only exist in one of the child tables. If SQL Server knew that, then it would only have to join to one child table for each row in Products.

Let's try this query:

SELECT
	p.Id AS ProductId,
	p.ProductTypeId,
	COALESCE(ip.ItemColumn, sp.ServiceColumn) AS OtherColumn
	FROM [dbo].[Products] p
	LEFT OUTER JOIN [dbo].[ItemProducts] ip WITH(FORCESEEK) ON
		(ip.ProductId = p.Id) AND
		(p.ProductTypeId = 1) /*****/
	LEFT OUTER JOIN [dbo].[ServiceProducts] sp WITH(FORCESEEK) ON
		(sp.ProductId = p.Id) AND
		(p.ProductTypeId = 2) /*****/

Now we're telling SQL Server something about our business logic. Let's see if this improves the execution plan:

 

That's better. SQL Server has added two Filter operators -- one for each child table -- that reject rows that don't satisfy the Startup Expression Predicate (in other words, the extra business logic we told SQL Server). This results in only a single seek against the proper child table for each row in the Products table. This could provide a big performance boost: for the number of child tables (m) and the number of parent rows (n), this approach will always execute only n seeks (thus making the number of seeks independent of the number of child tables), instead of m*n as the first approach does. This does of course come at the penalty of storage to denormalize enough information (ProductTypeId in this case) to drive the process, but usually that's not going to be a huge hit (most likely 1 byte per row in the parent table).

As a bonus, here's a different approach to writing the same query. This form may be more appropriate for some things, depending on what you're trying to do:

SELECT
	p.Id AS ProductId,
	p.ProductTypeId,
	a.OtherColumn
	FROM [dbo].[Products] p
	CROSS APPLY
	(
		SELECT
			ItemColumn AS OtherColumn
			FROM [dbo].[ItemProducts] ip
			WHERE
				(ip.ProductId = p.Id) AND
				(p.ProductTypeId = 1) /*****/
				
		UNION ALL
		
		SELECT
			ServiceColumn
			FROM [dbo].[ServiceProducts] sp
			WHERE
				(sp.ProductId = p.Id) AND
				(p.ProductTypeId = 2) /*****/
	) a;

And here is the resulting query plan that contains the Startup Expression Predicate Filter operators:

 

Conclusion

Sometimes giving SQL Server more information than you might think is necessary can help to improve the query plans that are generated. Certainly in cases like this parent/child example, we were able to exploit a denormalized ProductTypeId column to drive the index seeks to the child tables, and make the query scale much better. The result in this case was that the total number of seeks against the child tables became independent of the number of child tables, while still retaining the original query logic. Look for opportunities like this in your queries to give SQL Server extra hints about your table schema -- you can be rewarded with more scalable queries.

 

More?

As I was playing around with these examples, in particular the second query, I found it interesting that for some reason if the plan used a scan operator as the lower input of the nested loops join (such as I got by not using the FORCESEEK hints), there were no startup expression predicates to be found (nor Filter operators). Instead, the predicate end up on the nested loops operator itself, with each child table scanned for every upper input row. This is somewhat puzzling, as I can't think of a reason why the lower input couldn't be protected by a startup expression in that scenario as well. (Note: I only tested on a 2008 R2 RTM instance.)

Jan
22
2013

Timeouts vs. blocking vs. deadlocks

These three issues are all related, yet very different. When we deal with reports of problems, it's essential to clearly identify what is actually happening. Sometimes, we need clarification by asking about the symptoms, instead of accepting what the user says without delving any deeper. The root cause of a problem may not even be what the user is experiencing directly, or what they claim they're experiencing.

In this post, we'll cover the similarities and differences of timeouts, blocking, and deadlocks, how they manifest themselves to an end user, and how they may cause, or lead to, each other.

 

Blocking

Blocking (also known as live locking) occurs when a process attempts to acquire a lock on a resource, but has to wait in the locking queue before the lock is granted.

From the outside, it appears that the process is doing nothing, when in fact it's waiting for the other process(es) to release the lock(s) ahead of it in the queue.

If a process is blocked for too long, this can lead to timeouts. If locks are acquired in a specific order, blocking can lead to deadlocks.

The thing to remember is that blocking occurs all the time to synchronize access to resources. It only becomes a problem when the amount of time a process spends waiting becomes excessive, either because it's causing timeouts, or because it's causing a user to wait for more time than they expect is normal. Blocking may also be experienced as a "frozen" application. Users may not complain about blocking until their work is impeded enough to frustrate them into taking action to get the problem fixed.


Timeouts

Timeouts occur when a client application makes a request for a resource, and while waiting for a response to be returned, decides things have taken too long, and stops waiting. This usually results in an error being raised by the client API.

Timeouts occur for a variety of reasons, including blocking, the request needing to do a ton of work on the server, or maybe the network is saturated or simply really slow. There are millions of reasons, all ultimately because the client was waiting, and gave up after a certain period of time (the timeout threshold).

An important thing to note about timeouts is that they do not cause other problems (except if the error that's raised is handled incorrectly). In other words, timeouts are only a symptom of another problem; to solve the timeout issue, solve the issue that caused the timeout.

Since timeouts always depend on something else happening (or not happening), they can be difficult to reproduce in a testing environment. Without sufficient tracing/logging mechanisms in place, it can be difficult to track down the source of the problem. Users will usually be able to tell you that something is timing out, and probably what they were doing at the time, but finding the cause often requires some digging.

Deadlocks

Deadlocks occur when two or more processes hold locks on resources (i.e., block each other), and also try to take locks on resources held by the other process(es).

This creates a situation such that neither process can continue unless one of them is terminated by another external process. In SQL Server, this external process is called the Deadlock Monitor, and upon detecting a deadlock, it summarily rolls back one or more of the queries to resolve the situation.

I demonstrated how deadlocks are created in a demo video in a previous post. As part of setting up the demo, I showed how blocking works in relation to deadlocks.

By definition, deadlocks are caused in part by blocking, and in part by the order in which the locks were acquired. Usually it's very clear to users when deadlocks occur because of the error messages. In some cases, deadlocks can lead to timeouts if the external monitoring process takes too long to pick up on the fact there is a deadlock in the system (the deadlocked processes are blocked and waiting).

Fixing deadlocks is often an exercise in information gathering, because they normally aren't reproducible except under specific circumstances. The key is setting up a tracing/logging solution to record the pertinent information when deadlocks happen, so you can analyze and fix the problems later. I explained how to do that in a previous post as well. A more proactive solution would be to determine the "hot" areas of the application, and ensure that locks are acquired in the same order in every piece of code that accesses that area.

While well-written applications may attempt to retry the operation in progress when a deadlock occurs, the question of why the deadlock occurred in the first place must still be asked and answered; tracing/logging or other solutions still apply.

 

As you can see, there are relationships between blocking, timeouts, and deadlocks. The next time you deal with an error report that involves one of these three phenomenon, I encourage you to dig deeper, and put in place processes that either prevent the problem from occurring in the first place, or simply record enough information to be able to fully solve the problem later.

Jan
3
2013

Globalizing applications, in any locale

I was watching a car review on YouTube the other day, and the host made the point that the voice recognition system only understood names as they're pronounced, according to the locale of the infotainment software (U.S. English in this case). Moreover, he said the user should think about how this process works in order for it to be more usable.

Now, that's a great idea, and it kinda-sorta works, but it's not very user-friendly. Users shouldn't be expected to understand how a system functions internally for it to be usable. Many people know how to drive, yet they may not understand exactly how a car works.

These are the kinds of issues we encounter when developing globalized applications. How do we bridge the gap between what the user understands, and what the computerized system understands?

I think it comes down to assumptions. In other words, the computerized system assumes something, while the user either isn't aware of that assumption, or doesn't provide enough input (or is never prompted for input) for the system to do anything except make an assumption.

In the case of the voice recognition system, the system assumes the user's contact names are pronounced based on the locale of the system. This sounds like a reasonable assumption, and for the most part, it works. When thinking about it in the context of an increasingly globalized society, though, it's becoming a less and less valid assumption. If the system is only single-locale-aware, there's no other choice for it to make, just given a list of names.

While giving the user the ability to choose a different default locale for the system probably helps quite a bit, it doesn't solve the problem of the inevitable exceptions to the default.

If we think about the data available, is there any way we could help the system better understand the exceptions?

  • We could add a property to each name that specifies the locale in which to interpret it. This would solve the problem of complete misinterpretation as shown in the car review. It doesn't, however, solve the problem of idiosyncrasies within a locale (i.e. "Smith" and "Smythe" could be pronounced the same, and in more than just one way), and it would make the software more complex (although it could potentially save on costs, because the same bits could be deployed to all locales, just with different default settings).
  • Another approach would be to allow the user to input a phonetic version of the name, in order for the system to understand the pronunciation based only on the single default locale. This would solve the problem of exceptions, and the issue of same-pronunciations-different-spelling as mentioned in the previous point. While the phonetic data is assumed to be of the same locale as the locale of the names themselves, this is probably an acceptable drawback for most applications.

With globalization, other factors come into play, too:

  • The characters accepted by the system. For example, if the system's default locale is Chinese, is it going to allow the entry of names using English characters, and how would the system understand a phonetic spelling of that name?
  • What if there are differences between languages where some soundings of syllables may not exist in both?
  • How would efficient searching of the same data be accomplished for multiple locales?
  • How much extra effort (i.e., cost and time) is required to add a reasonably globalized solution for the specific application?

As you can probably tell, these are not necessarily problems that are easily and effortlessly solved, and I'm not going to claim for a second that either of the two approaches I mentioned will solve all application issues related to globalization. Since every application is different, though, they may offer a good solution, or at least a reasonable compromise -- either way, an improvement in usability, which was the original goal.

As database designers, we must be aware of how globalization affects which pieces of data we store, how they are arranged/sorted for efficient searching, and how users will be interacting with the data. It's also important to think about designing database structures such that they can easily be enhanced to accommodate a globalized application, even if the application will not support that kind of scenario in Version 1. It's coming; be prepared for it.

Dec
20
2012

Collation of character fields in temporary objects

It's a common practice to use scalar variables, table variables, and temporary tables when developing a database application. If these transient objects contain (or are) character fields, which collation do they have if a collation isn't explicitly specified?

 

Test Setup

The idea of the test is to demonstrate how collation conflicts are possible when using temporary objects.

First, we'll create a new database with a collation that differs from the instance collation. I've chosen one randomly to make it easy to pick out for the purposes of demonstration. The collation of the instance I'm testing on is SQL_Latin1_General_CP1_CI_AS, which is probably the most common.

CREATE DATABASE CollationTest COLLATE Thai_CS_AI;
GO

USE [CollationTest];
GO

Next, we'll create a base table with a couple of character fields in it, and a row of data to test with:

CREATE TABLE [dbo].[Table1]
(
	/* Inherit the default collation */
	Col1 varchar(20),

	/* Same as the instance collation */
	Col2 varchar(20) COLLATE SQL_Latin1_General_CP1_CI_AS
);
GO

INSERT INTO [dbo].[Table1](Col1, Col2) VALUES (N'a', N'b');
GO

Let's verify Col1's collation:

SELECT
	collation_name
	FROM sys.columns
	WHERE
		(object_id = OBJECT_ID(N'[dbo].[Table1]')) AND
		(name = N'Col1');

Running that query returns Thai_CS_AI, which was inherited from the database collation.

 

Scalar Variables

It turns out that there's no way to get a collation conflict using a scalar variable unless we explicitly use COLLATE on both sides of the comparison expression, like so:

DECLARE @v1 varchar(20) = 'B';

SELECT * FROM [dbo].[Table1]
	WHERE Col2 COLLATE Thai_CI_AI = @v1 COLLATE Thai_CS_AI;

The reason why is because of collation precedence: scalar variables (and string literals, by the way) are always considered to be coercible-default. In other words, the comparison is done using the collation of the field on the other side of the expression:

DECLARE @v1 varchar(20) = 'B';

SELECT * FROM [dbo].[Table1] WHERE Col2 = @v1;

The above query returns a row because the comparison used the case-insensitive SQL_Latin1_General_CP1_CI_AS collation.

It's interesting to note that the collation of a scalar variable can't be explicitly set in its definition; it will always inherit the database collation.

 

Table Variables

Since we know that table variables are actually backed by tempdb, we can now run some tests where the collation of the object doesn't always get clobbered by the comparison field.

Let's try comparing two fields that have the same collation label (implicit):

DECLARE @t table (Col1 varchar(20));

SELECT * FROM [dbo].[Table1] a INNER JOIN @t b ON b.Col1 = a.Col1;

This succeeds. We're comparing a Thai_CS_AI field in the base table with a table variable whose field Col1 has inherited the same collation from the database.

Now let's try two different collations by comparing against Col2 in the base table, and see what happens:

DECLARE @t table (Col1 varchar(20));

SELECT * FROM [dbo].[Table1] a INNER JOIN @t b ON b.Col1 = a.Col2;

Now we get the error, "Cannot resolve the collation conflict between "SQL_Latin1_General_CP1_CI_AS" and "Thai_CS_AI" in the equal to operation." In this case, the two collations are different and at the same precedence level. SQL Server can't simply give one collation precedence over the other, and it throws up its hands in disgust.

 

Temporary Tables

As mentioned in the link in the previous section, there are many similarities and differences between table variables and temporary tables. Let's find out where character field collation fits in. We'll run the exact same tests as in the table variable section, using a temporary table instead.

CREATE TABLE #t (Col1 varchar(20));

SELECT * FROM [dbo].[Table1] a INNER JOIN #t b ON b.Col1 = a.Col1;
GO

DROP TABLE #t;

Now we get a collation conflict error. What? It turns out that temporary tables are different than scalar variables in this respect -- the field collations now inherit from tempdb instead of from the current database. (Edit 2013-02-10: Thanks to Aaron Bertrand for pointing out that temporary tables created in the context of a contained database inherit the collation from the current database, not from tempdb.)

Running the second test now succeeds because both fields have an implicit collation of SQL_Latin1_General_CP1_CI_AS (since that is my instance, and hence tempdb, collation):

CREATE TABLE #t (Col1 varchar(20));

SELECT * FROM [dbo].[Table1] a INNER JOIN #t b ON b.Col1 = a.Col2;
GO

DROP TABLE #t;

 

DATABASE_DEFAULT

Because of this crazy behaviour with temporary tables, a virtual collation DATABASE_DEFAULT was created to say that the field should use the collation of the current database, instead of inheriting from the database where the field was created.

Here's the first query from the previous section again, with the special collation added:

CREATE TABLE #t (Col1 varchar(20) COLLATE DATABASE_DEFAULT);

SELECT * FROM [dbo].[Table1] a INNER JOIN #t b ON b.Col1 = a.Col1;
GO

DROP TABLE #t;

And this succeeds like we would expect.

 

Conclusion

When working with table variables and temporary tables, it's really important to know the similarities and differences between the two. In the case of character field collations, simply switching between the two types of tables (which is often done when doing performance tuning) can potentially have bigger side-effects than were intended!

Even though character fields in table variables always inherit the collation of the current database, I would strongly consider explicitly specifying the DATABASE_DEFAULT collation on those fields anyway (assuming this is appropriate), so that the data/business logic will still be the same if someone decides to go and switch things up when performance tuning.

Dec
18
2012

T-SQL Code Review Checklist

In software development, code review is a simple and effective way to reduce the number of bugs that creep in. Merely having a second pair of eyes look at a piece of code can often reveal logic that was implemented incorrectly or weakly, or misses use cases never considered. (Test-driven development is a separate discussion entirely; T-SQL unit testing is not the easiest thing in the world to do, and compromises sometimes have to be made, usually due to time constraints or politics.)

There are many good reasons why code review is a good practice, and I'm not going to get into that here. You can use Google as well as I can, or you can start by reading some of these questions on Programmers Stack Exchange. The purpose of this post is to consolidate many of the different things I look at when doing code review, or even when I'm looking over my own code. While this won't be a complete list, and the items will vary somewhat depending on your company's environment, it should be a good start.

In no particular order:

  • Syntax, including version-specific language use. This seems really obvious, but it can be easy to slip up. For example, if you normally do your development on a 2008+ server and the application still has to support 2005, it's easy to write something like DECLARE @i int = 0 and only have this get caught during testing. (Testing is done against all versions of the database engine that are supported by the application, right? ...right?) In terms of outright syntax, pay particular attention to T-SQL that's contained in string variables, meaning dynamic SQL, or SQL statements embedded in client applications, as these aren't validated until they are run through the containing code. Do all INSERT statements have a column list, and are all required columns specified?
  • Data/business logic protection. This could apply to either designing tables (which I do recommend code-reviewing, too), or when writing code to manipulate data (in the tables or otherwise). Are there enough constraints on the tables to prevent invalid data from entering the system? Are there any logical gaps in a piece of code? Does the piece of code accomplish the prescribed task? If in doubt, ask the developer what was intended. Sometimes logical gaps can be okay if a condition is rare, and the code is appropriately guarded. These same ideas apply to things like procedure parameters, particularly if any of the parameter values come from application user input.
  • Dynamic SQL. Follow through all logic paths to make sure the statements are being concatenated correctly for both syntax and potential injection attack vectors. Make sure the statements are parameterized appropriately.
  • Code Cleanliness. This is a bit tough to define. I personally don't expect strict adherence to a style guide, for example, but I do expect code to be legible, and make appropriate use of indentation, line breaks and comments. Everyone has their own natural style -- and even that can change over time -- so I think giving people more or less free reign over style is fine. Again, as long as the code is legible to others, because that is the most important thing.
  • Version- or edition-specific feature use. This shouldn't come up, as it should have been decided during the design phase, but at the end of the day, a developer can check anything in to the version control system.
  • Use of unexpected methods to accomplish a task. Does the code use a Rube-Goldberg-like approach to accomplish the task? Are cursors being used when all that's necessary is a slightly more difficult to write set-based operation? In these types of cases, I will always go back to the developer and find out why this kind of approach was taken, and what was intended to be accomplished. Most of the time, this kind of situation provides a great learning opportunity.
  • Query plan reuse. Does the code reuse (or not reuse) query plans as expected? Pay particular attention to dynamic SQL: for example, there can be subtle string concatenation bugs where two generated statements (via different code paths) only differ by whitespace, but this is enough to produce a separate query plan. Sometimes dynamic SQL is used to force the creation of an alternative plan for the expected data size -- does this work as expected?
  • Performance. Does the code meet the performance requirements, if any, and/or does it execute in a "reasonable" amount of time with a "reasonable" amount of resource usage? Does the code use table variables or temporary tables, and how does this interact with the query plan(s)? Does the code produce efficient query plans that will scale? It's important to test code on a real data set to find out these things. Some of this is technically a "testing" activity; however, I would say it's unlikely a testing department will have the technical skill required to determine if the code is doing the right thing. Reading and understanding query plans is a huge topic in itself; in short, look for inefficient operators that are unexpected, such as table scans (particularly as the bottom operator in a nested loops join), huge row counts (thick arrows), missing indexes, etc. Speaking of indexes, determine if the code would benefit from index changes, even if query performance is currently acceptable. Even if you don't make those changes now, record your findings, as it could be valuable information to have later if performance does become a problem. This could potentially save an incredible amount of time and effort doing a workload analysis.
  • Deprecated features. This is fairly broad, as deprecation can happen for a wide range of things, from major features (i.e., database mirroring) to minor T-SQL syntax (i.e., using HOLDLOCK without surrounding parentheses). That said, for every new release of SQL Server, it's a good idea to thoroughly read the official list of deprecated features, and make sure no new code is using any of it unless absolutely necessary (and in that case, there needs to be a plan to modify the system to remove the deprecated feature usage). I have a script on the Scripts & Code resource page that dumps the internal performance counters that track usage of deprecated features, which I think may be more broad than the official reference list. This method is great if you want to find out what your application is really doing in production, so you can proactively fix it long before it becomes a problem.

That's a pretty good start! If you think I've left out something important, let me know in the comments and I'll add it to the post, as I want this to be a good reference guide.