Feb
18
2014

Scale out database applications using SQL Server Service Broker (video)

Database applications can be scaled up or out by using SQL Server Service Broker, which provides asynchronous messaging and queuing infrastructure. The great thing is that the framework of this functionality is already built into your existing databases! In this video, I explain the basics of how a Service Broker application works and the database objects behind creating your own Service Broker applications.

 

 

Video resources:

SqlDependency class
Service Broker samples on CodePlex
ssbdiagnose utility

Jan
7
2014

Clustered vs. Nonclustered Index Internal Structures (video)

Clustered and nonclustered indexes share many of the same internal structures, but they're fundamentally different in nature. In this video, I compare the similarities and differences of clustered and nonclustered indexes, using a real-world example to show how these structures work to improve the performance of SQL queries.

 

 

Video resources:

Blog post on primary key vs. the clustered index
CREATE INDEX statement reference
ALTER INDEX statement reference
Index navigation internals by example

Sample index data is from the AdventureWorksLT2008R2 sample database

Jan
3
2013

Globalizing applications, in any locale

I was watching a car review on YouTube the other day, and the host made the point that the voice recognition system only understood names as they're pronounced, according to the locale of the infotainment software (U.S. English in this case). Moreover, he said the user should think about how this process works in order for it to be more usable.

Now, that's a great idea, and it kinda-sorta works, but it's not very user-friendly. Users shouldn't be expected to understand how a system functions internally for it to be usable. Many people know how to drive, yet they may not understand exactly how a car works.

These are the kinds of issues we encounter when developing globalized applications. How do we bridge the gap between what the user understands, and what the computerized system understands?

I think it comes down to assumptions. In other words, the computerized system assumes something, while the user either isn't aware of that assumption, or doesn't provide enough input (or is never prompted for input) for the system to do anything except make an assumption.

In the case of the voice recognition system, the system assumes the user's contact names are pronounced based on the locale of the system. This sounds like a reasonable assumption, and for the most part, it works. When thinking about it in the context of an increasingly globalized society, though, it's becoming a less and less valid assumption. If the system is only single-locale-aware, there's no other choice for it to make, just given a list of names.

While giving the user the ability to choose a different default locale for the system probably helps quite a bit, it doesn't solve the problem of the inevitable exceptions to the default.

If we think about the data available, is there any way we could help the system better understand the exceptions?

  • We could add a property to each name that specifies the locale in which to interpret it. This would solve the problem of complete misinterpretation as shown in the car review. It doesn't, however, solve the problem of idiosyncrasies within a locale (i.e. "Smith" and "Smythe" could be pronounced the same, and in more than just one way), and it would make the software more complex (although it could potentially save on costs, because the same bits could be deployed to all locales, just with different default settings).
  • Another approach would be to allow the user to input a phonetic version of the name, in order for the system to understand the pronunciation based only on the single default locale. This would solve the problem of exceptions, and the issue of same-pronunciations-different-spelling as mentioned in the previous point. While the phonetic data is assumed to be of the same locale as the locale of the names themselves, this is probably an acceptable drawback for most applications.

With globalization, other factors come into play, too:

  • The characters accepted by the system. For example, if the system's default locale is Chinese, is it going to allow the entry of names using English characters, and how would the system understand a phonetic spelling of that name?
  • What if there are differences between languages where some soundings of syllables may not exist in both?
  • How would efficient searching of the same data be accomplished for multiple locales?
  • How much extra effort (i.e., cost and time) is required to add a reasonably globalized solution for the specific application?

As you can probably tell, these are not necessarily problems that are easily and effortlessly solved, and I'm not going to claim for a second that either of the two approaches I mentioned will solve all application issues related to globalization. Since every application is different, though, they may offer a good solution, or at least a reasonable compromise -- either way, an improvement in usability, which was the original goal.

As database designers, we must be aware of how globalization affects which pieces of data we store, how they are arranged/sorted for efficient searching, and how users will be interacting with the data. It's also important to think about designing database structures such that they can easily be enhanced to accommodate a globalized application, even if the application will not support that kind of scenario in Version 1. It's coming; be prepared for it.

Nov
15
2012

How much overhead do CHECK constraints have?

When designing new database structures, I feel it's really important to make sure that the data goes into the new tables as cleanly as possible. Doing this early on helps prevent bugs from creeping into the system as its developed. And preventing bugs as early as possible in the development cycle is almost certainly the most cost-efficient strategy -- cleaning up bad data later on is extremely expensive, time consuming, and frustrating.

One of the tools that can be used to reject bad data values going into tables are CHECK constraints. This type of constraint is meant to be used to validate a single row based on a predicate that's a function of any combination of the table columns in that row. (CHECK constraints can be used for things beyond this definition... but doesn't mean they should be.)

In this post, I'm not going to address any more of the advantages of constraining data values. What I do want to investigate is performance.

The extra logic to do this validation necessarily has additional processing cost associated with it. The question is: how much extra? Does it really make a difference? What kind of general guidelines can be established for performance purposes?

 

The Setup

I decided to run two series of tests: one with a very narrow table (3 columns), and one with a wide table (21 columns). We'll INSERT a bunch of test data into each of the tables. A baseline would be established for each test, where the tables have no constraints. Then, different constraints would be added, and hopefully we'll be able to see a difference in the load times.

Here is an abridged version of the table creation script (note: all scripts are available to download in full at the bottom of this post):

-- Narrow, no constraints
CREATE TABLE [dbo].[NarrowWithoutConstraints]
(
	Id int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
	Col1 int NOT NULL,
	Col2 varchar(10) NOT NULL
);

-- Wide, no constraints
CREATE TABLE [dbo].[WideWithoutConstraints]
(
	Id int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
	Col10 int NOT NULL,
	...
	Col19 int NOT NULL,
	Col20 varchar(10) NOT NULL,
	...
	Col29 varchar(10) NOT NULL,
);

-- Narrow, simple constraints
CREATE TABLE [dbo].[NarrowWithConstraints1]
(
	Id int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
	Col1 int NOT NULL CHECK (Col1 >= 0),
	Col2 varchar(10) NOT NULL CHECK (LEN(Col2) > 0)
);

-- Wide, simple constraints
CREATE TABLE [dbo].[WideWithConstraints1]
(
	Id int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
	Col10 int NOT NULL CHECK (Col10 >= 0),
	...
	Col19 int NOT NULL CHECK (Col19 >= 0),
	Col20 varchar(10) NOT NULL CHECK (LEN(Col20) > 0),
	...
	Col29 varchar(10) NOT NULL CHECK (LEN(Col29) > 0)
);

-- Narrow, complex constraints
CREATE TABLE [dbo].[NarrowWithConstraints2]
(
	Id int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
	Col1 int NOT NULL
		CHECK ((Col1 >= 0) AND (Col1 > -9999) AND (Col1 < 5000) AND (Col1 < 9999)),
	Col2 varchar(10) NOT NULL
		CHECK ((LEN(Col2) > 0) AND (DATALENGTH(Col2) = DATALENGTH(LTRIM(RTRIM(Col2)))))
);

-- Wide, complex constraints
CREATE TABLE [dbo].[WideWithConstraints2]
(
	Id int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
	Col10 int NOT NULL
		CHECK ((Col10 >= 0) AND (Col10 > -9999) AND (Col10 < 5000) AND (Col10 < 9999)),
	...
	Col19 int NOT NULL
		CHECK ((Col19 >= 0) AND (Col19 > -9999) AND (Col19 < 5000) AND (Col19 < 9999)),
	Col20 varchar(10) NOT NULL
		CHECK ((LEN(Col20) > 0) AND (DATALENGTH(Col20) = DATALENGTH(LTRIM(RTRIM(Col20))))),
	...
	Col29 varchar(10) NOT NULL
		CHECK ((LEN(Col29) > 0) AND (DATALENGTH(Col29) = DATALENGTH(LTRIM(RTRIM(Col29))))),
);

-- Narrow, many complex constraints
CREATE TABLE [dbo].[NarrowWithConstraints3]
(
	Id int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
	Col1 int NOT NULL
		CHECK ((Col1 >= 0) AND (Col1 > -9999) AND (Col1 < 5000) AND (Col1 < 9999)),
	Col2 varchar(10) NOT NULL
		CHECK ((LEN(Col2) > 0) AND (DATALENGTH(Col2) = DATALENGTH(LTRIM(RTRIM(Col2)))))
);

ALTER TABLE [dbo].[NarrowWithConstraints3]
	ADD CHECK ((Col1 > -1) AND (Col1 > -9998) AND (Col1 < 5001) AND (Col1 < 9998));

...

ALTER TABLE [dbo].[NarrowWithConstraints3]
	ADD CHECK ((Col1 > -9) AND (Col1 > -9990) AND (Col1 < 5009) AND (Col1 < 9990));

ALTER TABLE [dbo].[NarrowWithConstraints3]
	ADD CHECK ((LEN(Col2) > -1) AND ((DATALENGTH(Col2) + 1) = (DATALENGTH(LTRIM(RTRIM(Col2))) + 1)));

...

ALTER TABLE [dbo].[NarrowWithConstraints3]
	ADD CHECK ((LEN(Col2) > -9) AND ((DATALENGTH(Col2) + 9) = (DATALENGTH(LTRIM(RTRIM(Col2))) + 9)));

 

The Test

Testing these scenarios is pretty simple: shove a bunch of data into a table, and time how long it takes. One of the test harnesses is shown below:

DECLARE @startTime datetime2;
DECLARE @endTime datetime2;
DECLARE @i int = 1250;

SET NOCOUNT ON;

TRUNCATE TABLE [dbo].[<table>];
DBCC CHECKIDENT('dbo.<table>', RESEED, 1);

SET @startTime = SYSDATETIME();

WHILE @i >= 0
BEGIN

	INSERT INTO [dbo].[<table>](Col1, Col2)
		SELECT
			number,
			CAST(number AS varchar(10))
			FROM master..spt_values v
			WHERE v.type = 'P'

	SET @i -= 1;

END

SET @endTime = SYSDATETIME();

SELECT DATEDIFF(MILLISECOND, @startTime, @endTime);

So that's pretty standard. I ran the narrow table test with @i = 1250, and the wide table test with @i = 750, which resulted in 2,560,000 and 1,536,000 rows inserted, respectively. For each table, I ran the test harness 11 times; the first result was discarded, and the rest got recorded and averaged together. Note that the table is TRUNCATEd and the identity seed reset before each run. Also, the database was in SIMPLE recovery.

 

The Results

My hypothesis was that the constraints (at least the simple ones) would add a little bit of overhead, but I wasn't sure if it would be measurable, and it was unclear how much (if anything) SQL Server would do to simplify the expressions I used (turns out it didn't simplify them at all).

  Average (ms) ms/1000 Rows % Increase
Narrow (no constraints) 12,401 0.484 -
Narrow (simple constraints) 13,034 0.509 5.1%
Narrow (complex constraints) 14,505 0.567 17.0%
Narrow (many complex constraints) 25,101 0.981 102.4%

There are a couple of very interesting things here. First of all, the difference between no constraints and a small number of simple constraints is measurable. Second -- this is what really surprised me -- a complex constraint is significantly more expensive than the simple constraint. This makes sense because functions are always going to be more expensive than something like a basic comparison. It seems that perhaps this difference is magnified in this case because the amount of work to do the actual INSERT is small relative to the total amount of work needed to be done. That said, it appears that many complex constraints could at the very least hobble an INSERT-only workload.

 

  Average (ms) ms/1000 Rows % Increase
Wide (no constraints) 12,477 0.812 -
Wide (simple constraints) 14,647 0.954 17.4%
Wide (complex constraints) 20,238 1.318 62.2%

In this test, we would expect that the differences are larger because of the number of constraints used. As a sanity check, we can see that the ratio of the increases (62.2% / 17.4% = 3.6) vs (17.0% / 5.1% = 3.3) is approximately the same.

If we compare the narrow table with many complex constraints with the wide table with complex constraints, they both have the same number and type of constraints, and the ms/1000 rows numbers have the same sort of increases (0.981 - 0.484 = 0.497) vs (1.318 - 0.812 = 0.506), respectively. This says that evaluating constraints has a fixed cost, independent of the table schema.

 

Conclusion

I certainly learned a lot from doing this experiment. There are a few takeaways from the results:

  • Disabling CHECK constraints will reduce the load time, perhaps significantly.
  • A CHECK constraint's overhead is determined by it's complexity of evaluation (comparisons good; function calls not so much).
  • Evaluating CHECK constraints has a fixed cost, and thus a smaller relative cost when the other parts of the process are more expensive.

While these aren't Earth-shattering conclusions, we did get to validate the best-practice of disabling constraints for loading, and see just how much overhead CHECK constraints can add.

What I want to stress here is that this test was for a relatively narrow situation involving only INSERTs. If your workload is 100% INSERTs, then you definitely need to pay attention to these kinds of things, but if your system isn't INSERTing 200k rows/sec -- or even anywhere near that -- it's likely that the benefits of the constraints will outweigh the performance penalty, as the overhead will be a small portion of the overall workload.

 

Full scripts and test results:

01 - Create Tables.sql

02 - Narrow Table Test Harness.sql

03 - Wide Table Test Harness.sql

Test Results.xlsx

Oct
16
2012

Choosing a database recovery model

The recovery model of a database is a critical setting that determines how the transaction log both records and retains information about transactions that occur in the database. It's important to remember that the setting always applies to the entire database.

As we'll explore in this post, there are 3 recovery models in SQL Server to choose from: FULL, BULK_LOGGED, and SIMPLE; each model has advantages, disadvantages, and tradeoffs. Only you can decide which model is appropriate for your particular situation and goals; we'll go through some of those tradeoffs and a few typical scenarios.

First of all, let's take a quick step back and ask why we care about what gets recorded in the transaction log.

 

What is recorded in the transaction log?

The transaction log contains an ordered sequence of physical operations that occur within a database. Each physical operation is written to the transaction log before the change itself is applied to the data page(s). This is called write-ahead logging, and it occurs regardless of the recovery model chosen.

Usually, the physical operations written to the transaction log also contain enough information to perform the inverse operation of the change (i.e., to undo a delete, do an insert). This combined with the write-ahead logging mechanism ensures transactions succeed or fail as a whole (atomicity) by giving SQL Server the ability to both undo and redo transactions. Hopefully those two bolded words set off hints about how logging is intimately tied to backup, restore, and several other features in SQL Server -- we'll get to those in a minute.

 

Minimal Logging

There are a handful of operations we can do in a database that can be minimally logged -- this means that only enough information required to undo the transaction is logged. In the case of a BULK INSERT, for example, all that's logged are page allocations -- we can undo those by deallocating the pages, but we can't redo them because the actual contents of the pages aren't recorded in the log.

We gain speed by writing to the transaction log less, but lose the ability to perform point-in-time recovery during the operation.

You might be wondering how a database restored using only transaction log backups can be complete after a point in time where an operation such as a BULK INSERT took place. How does SQL Server know what to put on those data pages, since only the allocations were logged?

The answer is in what is recorded by the first transaction log backup taken after the minimally-logged operation. SQL Server keeps track of extents that are modified by minimally-logged operations in a special metadata page (the ML map, or minimally-logged map, which contains 1 bit per extent), and copies those extents into the transaction log backup, along with the required portion of the transaction log itself. Yes, transaction log backups can contain data!

The net result of minimal logging isn't that we get the operation completely for free; we just defer a good chunk of the work to the backup subsystem for an immediate gain in speed.

 

Armed with this knowledge, let's take a look at the features of each of the three recovery models.

 

FULL

  • Full logging, all the time.
  • Supports point-in-time recovery (you must be taking transaction log backups).
  • Only transaction log backups internally clear the transaction log, so a good backup strategy is a must.
  • Required for certain features, such as Database Mirroring and Availability Replicas.
  • Typically used for OLTP databases, and where data protection is a high priority.

 

BULK_LOGGED

  • Full logging, except when the required conditions are met for minimally-logged operations.
  • Supports point-in-time recovery (again, you have to take transaction log backups) for all transaction log backups that do not contain minimally-logged operations. This means you should take a transaction log backup immediately before, and immediately after the minimally-logged operation(s), to maximize your ability to do a point-in-time restore.
  • Only transaction log backups internally clear the transaction log (same as FULL).
  • Used for performing bulk operations on a database that would otherwise prefer to be in FULL recovery. A database can be switched freely between FULL and BULK_LOGGED without breaking the transaction log backup chain.

 

SIMPLE

  • Full logging, except when the required conditions are met for minimally-logged operations.
  • Does not support point-in-time recovery, and the transaction log cannot be backed up.
  • A database CHECKPOINT internally clears the transaction log as part of the process (since you can't take transaction log backups), which means you don't necessarily have to manage the transaction log as much as the other recovery models.
  • Usually used for databases that frequently handle large quantities of data (i.e., data warehouses), or databases where point-in-time recovery is not important.
  • Switching a database from FULL or BULK_LOGGED to SIMPLE breaks the transaction log backup chain. The transaction log backup chain can be restarted by switching from SIMPLE back to FULL or BULK_LOGGED, and taking a full or differential backup.

 

A good place to start choosing a recovery model would be by answering the following questions. Remember that these are only guidelines, and your specific situation will determine the best course of action.

  • Does the database require point-in-time recovery? (Yes = FULL or BULK_LOGGED, No = SIMPLE)
  • Are special features going be used on this database? (Yes = FULL or BULK_LOGGED -- usually FULL -- depending on the feature)
  • Will large quantities of data be imported into, or manipulated within this database? (Yes = BULK_LOGGED or SIMPLE)