Oct
2
2012

The Primary Key vs. The Clustered Index

When it comes to clustered vs. non-clustered indexes in SQL Server, there's still a lot of confusion with those types of indexes, and constraints, particularly the primary key. Frankly, that's entirely understandable, because it is a bit confusing.

Let me try to simplify.

 

First, there are two ways indexes can be created:

  1. Directly (i.e., CREATE ... INDEX).
  2. Certain types of constraints (UNIQUE, and PRIMARY KEY) create indexes to implement the constraint.

The behind-the-scenes constraint indexes appear in both Management Studio and sys.indexes, which make the waters a bit muddy, so to speak. This is also how there is spillover in terminology and statement syntax between indexes and constraints.

The key to understanding how this works is to separate the ideas of constraints and indexes.

In theory, the constraints I mentioned above could be implemented behind the scenes using another type of structure completely different than an index. It just so happens that an index is an available, reusable, efficient structure that is able to get the job done.

 

Second, there is a default type of index created by each type of constraint, and the default can be overridden:

  • A PRIMARY KEY constraint creates a clustered index by default.
  • A UNIQUE constraint creates a non-clustered index by default.

 Here are all the combinations, and the type of index that gets created:

/* Clustered index */
ALTER TABLE MyTable
	ADD CONSTRAINT PK_MyTable
		PRIMARY KEY(WidgetId);

/* Clustered index */	
ALTER TABLE MyTable
	ADD CONSTRAINT PK_MyTable
		PRIMARY KEY CLUSTERED(WidgetId);

/* Non-clustered index */	
ALTER TABLE MyTable
	ADD CONSTRAINT PK_MyTable
		PRIMARY KEY NONCLUSTERED(WidgetId);


/* Non-clustered index */	
ALTER TABLE MyTable
	ADD CONSTRAINT UC_MyTable_WidgetNumber
		UNIQUE(WidgetNumber);

/* Non-clustered index */	
ALTER TABLE MyTable
	ADD CONSTRAINT UC_MyTable_WidgetNumber
		UNIQUE NONCLUSTERED(WidgetNumber);

/* Clustered index */	
ALTER TABLE MyTable
	ADD CONSTRAINT UC_MyTable_WidgetNumber
		UNIQUE CLUSTERED(WidgetNumber);

 

Finally, there is debate about whether a simple index, or a constraint should be created, because in a lot of cases, they can function identically with respect to user queries. There are several factors that come into play when making this decision:

  • Constraints have extra restrictions on them (i.e., columns in a primary key constraint cannot allow NULLs).
  • Creating a constraint instead of an index may serve to better document data/business rules in the database.
  • Constraints take a little more effort to drop, which reduces the likelihood of a mistake.
  • It may not be possible to represent a data/business rule using a constraint, while it is possible using a normal index. Constraint indexes can't use special features such as filters, for example.
  • Would creating an index with the same columns as a data/business rule better serve user queries? Constraint indexes don't allow included columns. If the table is large, creating and maintaining two separate indexes, one of which may be essentially redundant, can be expensive.

My personal philosophy is to first use constraints to maintain data/business rules, and then supplement with regular indexes as necessary. Constraints alone are frequently enough to satisfy user queries; however, there are definitely situations where it's clear that methodology should be broken. It's very situational, and really there's no better person qualified to evaluate the situation than you.

 

Summary:

  • Indexes and constraints are separate objects.
  • Be aware of the default types of indexes created by the different types of constraints.
  • Create either constraints and/or indexes based on your individual needs.
Sep
27
2012

Capture deadlocks 24/7 with Extended Events

Deadlocks can be notoriously difficult to reproduce in our development environments.

Most of the time:

  • There are multiple actual users involved with the problem.
  • The production workload, both in terms of volume and/or composition, can't be realistically reproduced in development either.
  • The tools we used to need to use to capture deadlocks (server-side trace, SQL Server Profiler) are prohibitively expensive to run all the time, or even for a short time, or we had to mess around with trace flags.

Here's where Extended Events come to the rescue, and really makes us able to be proactive to deal with deadlocks.

Extended Events allows us to capture deadlock graphs continuously, as they happen, 24 hours/day, 7 days/week, with very little performance overhead. Not only that, but the system we're going to set up is able to survive an instance restart! This is huge in terms of ease of management, and being able to have this information available when it's needed can be way more than half the battle.

Here is the meat of the script that sets up our event session for a SQL Server 2012 instance:

CREATE EVENT SESSION [DeadlockMonitor] ON SERVER 
	ADD EVENT sqlserver.xml_deadlock_report 
	ADD TARGET package0.event_file
	(
		SET
			FILENAME = N'\\CentralFileShare\MyTargetFile.xel',
			MAX_ROLLOVER_FILES = 0
	)
	WITH
	(
		EVENT_RETENTION_MODE = ALLOW_MULTIPLE_EVENT_LOSS,
		MAX_DISPATCH_LATENCY = 15 SECONDS,
		STARTUP_STATE = ON
	);

ALTER EVENT SESSION [DeadlockMonitor] ON SERVER
	STATE = START;

Note that the database engine service account needs to be granted write access to the target location.

For ease of multi-server management, I recommend setting up a centralized file share to store all the output files. The full script was written with this in mind, and uses the server name as the file name automatically.

Security note: Deadlock graphs may contain sensitive information, such as data values in DML statements. The target location should be secured appropriately -- consider using subfolders in the target path to separate by server if a single folder isn't okay for your environment.

 

So, now that we've captured some deadlocks (yay?), how do we figure out what happened? Here is a first start. It reads the output and returns the event data (in this case, a deadlock graph) as XML. This script can be used for all types of Extended Event file targets, not just for deadlocks.

SELECT
	object_name,
	CAST(event_data AS xml) AS EventXml
	FROM
		sys.fn_xe_file_target_read_file
		(
			'.xel',
			'.xem', /* Required for 2008, optional for 2012 */
			NULL, NULL
		)

XPath can be used to further break things down, or the event data can be copied as text and saved as a .xdl file, which we can then open up in Management Studio to see the pretty deadlock graph. If you have a centralized management box, this script could be the basis of an Extended Events aggregation and reporting system. (Hopefully, you aren't dealing with that many deadlocks, but this type of system could be used for other purposes.)

In a future post, we'll look at how to analyze deadlock graphs.

Sep
25
2012

Why would I want a read-only copy of my database?

When an OLTP database goes into production, automatically it's required to meet certain demands: work correctly and reliably, store data safely and securely, and meet the performance needs of the users.

Depending on what the database is used for, there often comes a time when the business asks us to go back to the data that has accumulated over time, and answer questions to drive the business forward. The business may even want -- or need -- those answers as soon as new data is entered into the system.

These new demands place an additional workload on the database that may or may not have been anticipated when the system was first designed. Sometimes this workload can consume a significant amount of resources.

No system can be designed from the get-go with perfect future-proofness. Sometimes changing the design of the system is more expensive than creating a new, totally separate system, to answer the business questions. Sometimes the production database already handles a workload near its maximum capacity, and couldn't maintain the original requirements after adding the new workload, even if the system was perfectly designed.

How could performance suffer with the additional workload?

  • Memory pressure: reading lots of historical data could potentially bump current data out of memory, causing disk thrashing to keep the required data in memory. This is even more perilous if the existing database is larger than the buffer pool. Using a different SQL instance or physical machine could alleviate this issue.
  • CPU usage: the workload may need to perform very intensive calculations on a lot of data all at once, thus slowing down other user queries. Business analysis queries may use parallelism, which will increase CPU usage, add contention for worker threads, and increase the probability of blocking and/or deadlocks. Using a different physical machine would alleviate this issue.
  • Locking, blocking, and deadlocks: if the production database runs at the default isolation level of READ COMMITTED* (or higher) and the business questions require us to use the same level (or higher), there is the potential to create blocks and/or deadlocks, particularly if the existing system is experiencing blocking or deadlocks already. This problem can be solved by using a separate copy of the data; in particular, a read-only database does not even take locks (because there are no writes allowed) and therefore there is no blocking, and no possibility of deadlocks.

The requirements for our project are as follows:

  • Answer the business questions
  • Minimize the impact on the production database
  • The data source should be transactionally consistent (most likely)
  • The data source should be kept up-to-date automatically (most likely)
  • The data source should be reconciled to a point-in-time according to the business needs (current as of today, or as of the last hour, etc.)
  • We don't need to make changes to data, so a read-only version would be okay

So how do we meet the requirements? A read-only copy of the production database may be the solution. In SQL Server, there are many ways of doing this, depending on our exact needs.

 

Transaction Log Shipping

Take a transaction log backup; copy the backup to another location; restore the backup. That's how log shipping works. Log shipping can be configured to restore the backups either in NORECOVERY mode, or STANDBY mode. The latter is what we're interested in, because it allows us to read the data.

Pros:

  • Very easy to set up, and simple to understand.
  • Usually we're already taking transaction log backups, so there's no additional load on production.
  • The log backups can be restored on another server.
  • Although all editions except Express have built-in tools to configure log shipping, it can still be accomplished in Express by rolling our own mechanism. After all, log shipping is just glorified transaction log backup and restore.

Cons:

  • May requires us to change our backup strategy, depending on how frequently the read-only copy needs to be updated.
  • Users have to be kicked out of the standby database to restore subsequent backups.
  • Requires enough storage to hold an entire second copy of the production database.

Notes:

  • The production database has to be using either the FULL or BULK_LOGGED recovery model.
  • CPU and disk resources are required to restore the backups.
  • Network resources may be required to copy the backups to a different location.
  • This gives us an entire copy of the database (this could be a pro or con).

 

Database Snapshots

This gives us a point-in-time, transactionally consistent copy of the entire database. We would keep the copy we read from up-to-date by dropping the snapshot, and taking a new snapshot at a later time.

Pros:

  • Only requires enough storage to hold the changes made to the production database between the rollover interval. This could allow us to take and keep multiple snapshots at different times and roll over as a window, without taking up N times the amount of storage as the original database.
  • Very little overhead to take a snapshot (therefore, it's also fast).
  • All database recovery models are supported.

Cons:

  • Introduces extra disk write workload depending on how many changes are made to the production database.
  • Can only be created within the same instance of SQL Server as the production database.
  • Requires Enterprise Edition or higher.

Notes:

  • Requires that the snapshot files be placed on an NTFS volume.
  • Each file of the source database has a corresponding snapshot file. Try using my sp_TakeSnapshot stored procedure as a base to automate the process of taking a snapshot.
  • This gives us an entire copy of the database.

 

Database Mirroring

As of SQL Server 2012, database mirroring is deprecated, but I'm going to mention it here anyway, because the proposed alternative still... has its issues. Database mirroring sends transaction log records over-the-wire to another instance of SQL Server, where they get replayed on the copy of the database. While the secondary database is not directly readable, database snapshots can be taken against it which are. So, the differences between this and snapshots are:

Pros:

  • Mirroring is usually used for high availability, so this could leverage existing infrastructure that would otherwise be underutilized.
  • The copy of the database can (in fact, must) be in a different instance (same machine, or another machine).
  • There's nothing to mess around with in the file system; instead, communication is done through endpoints.

Cons:

  • The production database must be using the FULL recovery model.
  • Requires enough storage for an entire second copy of the database (plus the snapshots).
  • While database mirroring is available in synchronous mode in Standard Edition and higher, we need snapshots for the purpose of this discussion, which requires Enterprise Edition. Enterprise also allows for asynchronous mirroring mode.

Notes:

  • The secondary has to do work to restore the log records, including operations such as index maintenance.
  • Network throughput may be a concern; it's usually better amortized over time than, e.g., log shipping.
  • Likely will need to poke holes in the servers' firewalls to allow communication through.
  • This gives us an entire copy of the database.

 

SQL Server Replication

Replication is a system built to sit on top of our database that reads committed transactions, convert them to a format any database system can understand, copies the changes to a subscriber, and applies those changes to another database. Unlike the other methods I've covered so far, replication doesn't actually give us a physically read-only copy of the database. It does, however give us a copy of the data which we can read. There are several different types of replication (snapshot, merge, and transactional), and for the purposes of this discussion I'm going to lump them all together into a single category.

Pros:

  • Can create a copy of a subset of our data, including vertical and/or horizontal partitioning of individual tables. Replication is the only SQL Server technology that lets us do this. This can be a huge pro for security of data and storage space, because we can publish only the objects we need to read later.
  • Express Edition can subscribe to all types of publications.
  • Standard Edition can publish all types of publications. **
  • Changes are applied to the subscriber without the need to kick people out. (Warning: this means we need to be mindful of the transaction isolation level we use when running queries.)
  • All database recovery models are supported.

Cons:

  • Requires configuration, storage, and management of an additional database (the distribution database).
  • Can be difficult to understand, configure, secure, and administer. (Don't let this short list of cons fool you.)

Depending on how replication is configured and what we need to accomplish, database schema changes may not propagate to the subscriber. Also, some operations such as index maintenance don't get applied at subscribers, which may or may not be desirable.

 

Availability Groups

This is SQL Server's newest offering, and it's a hybrid of database mirroring and SQL Server failover clustering all rolled into one feature. (This is the reason why database mirroring on its own is deprecated.)

Pros:

  • Allows scale-out of reads to multiple secondaries without the use of database snapshots.
  • Like database mirroring, the read scale-out can also be part of a high availability strategy.

Cons:

  • The production database must be using the FULL recovery model.
  • Requires Enterprise Edition.
  • Must run on a Windows Server failover cluster (WSFC).

Out of all the technologies I've listed, this one is probably the least likely to be a good solution to solve the problem of having a read-only copy of the database (and that's it), unless existing infrastructure is in place for other reasons.

 

So there we have it: 5 different technologies built into SQL Server to give us a read-only copy of our data. In addition to those, there are various storage-level technologies, including replication and snapshots, that can be used for the same purpose. Even if those technologies aren't suitable for your project now, it's worth asking your storage administrator to find out which options are readily available in the event you do need to use them in the future.

Another option is to enable one of the snapshot isolation levels, where only writers block writers. This doesn't provide a copy of the database, and it has performance implications, but it can be a very easy way out in some circumstances.

 

* READ COMMITTED is the default isolation level on the user-installable versions of SQL Server, while READ COMMITTED SNAPSHOT is the default for Azure databases (aka Windows Azure SQL Database).

** For completeness, Enterprise Edition is required for a peer-to-peer topology, but this is not a suitable technology to only solve the "read-only copy of our data" problem.

Sep
18
2012

Will the Real Query Text Please Stand Up?

(While I can't take credit for finding this issue originally -- I heard about it several months ago -- for the life of me I can't remember or find where it was mentioned, so if you know of a reference, please leave a comment.)

When I first mentioned that developers should try to move away from using single-line comments (using --) and towards block comments (using /* */) in T-SQL, it was understandable that I got a bunch of "ummm... what?" looks.

All comments are created equal, right?

Well, obviously not, or this post wouldn't need to exist.

Aside from the obvious differences, there's a particular issue with single-line comments that's indirect and somewhat subtle, yet it can be significant. If you've worked with SQL Server Profiler for any length of time, there is a decent chance you've come across this issue already. Whether you realized it or not is a separate question entirely.

Onwards.

The problem is that Profiler word-wraps captured SQL statements according to its window size. While that might seem okay at first glance, the fact is that artificially introducing extra newlines when the statement syntax depends on the arrangement of newlines... can lead to interesting things. Not only that, but the syntax highlighting is applied after the word-wrapping takes place, which exacerbates the problem.

Observe:

SELECT
	fulltext_catalog_id--, name, path, is_default, is_accent_sensitivity_on, data_space_id, file_id, principal_id, is_importing
	FROM sys.fulltext_catalogs

This query selects a single column from sys.fulltext_catalogs, while the other columns are commented out.

If I shrink the Profiler window (both to demonstrate the problem, and so screenshots nicely fit onto this page) and run the query, the statement appears like this:

That... doesn't look right.

If you had only seen the query in Profiler and didn't know what the original query was, do you know which query was actually executed?

The really dangerous thing is that if you inspect the erroneous-looking query closely, you'll find that it is a different, yet syntactically valid, query.

SELECT
	fulltext_catalog_id--, name, path, is_default, is_accent_sensitivity_on,
data_space_id, file_id, principal_id, is_importing
	FROM sys.fulltext_catalogs

(If you don't see it, this query aliases the fulltext_catalog_id column as data_space_id, and returns 4 columns in the result set.)

Here's what you see in Profiler when you run it:

Looks familiar, doesn't it?

 

Tips:

  • Use block (/* */) comments wherever possible in production code, particularly if the comment occurs within part of a query. IMO, it's easier and safer to just get in the habit of using block comments all the time rather than picking and choosing where to use each type, or having to teach this kind of esoteric information. I rarely take issue with personal coding style as long as code is legible, but this is an exceptional circumstance that can directly affect things beyond readability.
  • Keep lines short, and break up long lines (this is a good practice for readability anyway). This can be accomplished in 99.999% of cases by using a combination of structured parentheses, indentation, and aliases.
  • Any time you see a single-line comment (--) within the query text in Profiler, don't assume what you're being shown is what was actually executed. Copy the TextData cell (Ctrl+Shift+C) into a Management Studio query window to see the real query text.

(Note: the 2008 R2 version of Profiler is shown above; the 2012 version also exhibits this behaviour.)

Sep
12
2012

Fun with Multi-Column Foreign Keys and Nullable Columns

This is more of a reminder/warning/spread-the-word post to be really, really careful when dealing with multi-column foreign key relationships when one or more of the foreign columns allows NULL values.

Things... may not work as you might expect.

Consider the following script:

CREATE TABLE PrimaryTable
(
	Col1 int NOT NULL,
	Col2 int NOT NULL,
	
	CONSTRAINT PK_PrimaryTable
		PRIMARY KEY(Col1, Col2)
);

CREATE TABLE ForeignTable
(
	Id int IDENTITY PRIMARY KEY,
	Col1 int NOT NULL,
	Col2 int NULL,
	
	CONSTRAINT FK_ForeignTable_PrimaryTable
		FOREIGN KEY(Col1, Col2) REFERENCES PrimaryTable(Col1, Col2)
);

INSERT INTO PrimaryTable(Col1, Col2)
	VALUES (0, 0), (1, 1);


-- Statement 1
INSERT INTO ForeignTable(Col1, Col2)
	VALUES (0, 0);

-- Statement 2
INSERT INTO ForeignTable(Col1, Col2)
	VALUES (0, 1);

-- Statement 3
INSERT INTO ForeignTable(Col1, Col2)
	VALUES (0, NULL);

 

Statement 1 is standard, and succeeds as we'd expect. Similarly as expected, Statement 2 fails with a constraint violation as there's no row in PrimaryTable that has that combination of values.

What about Statement 3? If there was no foreign key, (0, NULL) would be an allowed combination of values, but what happens when there is a foreign key?

The answer is that the insert still succeeds.

I had expected it would fail just like Statement 2, because there isn't a (0, NULL) combination in PrimaryTable. In fact, it's not even an allowed combination to put in that table at all!

What happens is that NULL values are completely ignored from checking, and the constraint is satisfied as long as the non-NULL values match up to at least 1 row in the primary table. Edit: this is incorrect. If there are any NULLs in the foreign table columns, the constraint is not checked at all, which is even more perilous.

Multi-column foreign keys aren't exactly the best practice in the world as far as schema design goes, but you do encounter them in the wild occasionally. If a database is suspect from an integrity point-of-view (which usually means it has many NULLable columns), be aware of this behaviour when doing data cleanup or querying. The corollary is that if this kind of schema is found in a database, poor integrity may not be very far away either.

Add this to your own Blitz script:

SELECT
	fk.name AS ConstraintName,
	COUNT(*) AS ColumnCount
	FROM sys.foreign_keys fk
	INNER JOIN sys.foreign_key_columns fkc ON fkc.constraint_object_id = fk.object_id
	GROUP BY fk.name
	HAVING COUNT(*) >= 2;