Nov
6
2012

XML Schema Collections: the XML accelerator pedal

While some of us may disagree, I think XML is a great method of moving small-to-medium sized data sets into SQL Server.

Because of XML's generality, it can be used by pretty much any client application and technology you can think of, and this can be a big advantage to getting the job done quickly and flexibly. I personally like it most for passing in sets of parameters (i.e., multiple ids) into stored procedures, and sending in full sets of data values where it's much easier to consolidate them on the SQL side rather than using complicated application logic.

The downside is that the XML has to be shredded (i.e. parsed) on the SQL Server side, which uses CPU. Out of all the tiers in our application stack, SQL Server is the most expensive to scale up, so it's important to be mindful when putting operations with large processing overhead into production. Ideally, we'd like to use XML with as little overhead as possible.

When we tell SQL Server to shred an XML document into a rowset, it can't make assumptions about what the incoming XML is going to look like. In fact, by default, SQL Server assumes there could be multiple root elements in the document (i.e., it's an XML fragment)!

By using XML Schema Collections, we tell SQL Server what to expect, and the optimizer is able to tailor the query plan to only include the necessary physical operations. A great side-effect of this is that the incoming XML is automatically validated against the schema, essentially making the input "strongly typed" like we would with regular variables.

Let's walk through an example to demonstrate just how much of a difference this can make. (Note: XML Schema Collections are available in 2005+, but this example code is written for 2008+.)

First, I'll create a table into which I'll dump the shredded data:

CREATE TABLE [dbo].[Table1]
(
	Id int IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
	Value1 int NOT NULL,
	Value2 nvarchar(50) NOT NULL
);

 

Next, I'll create the XML Schema Collection for comparison testing. You can see all the XML Schema Collections in a database through Management Studio in the Programmability | Types | XML Schema Collections folder.

CREATE XML SCHEMA COLLECTION [dbo].[Table1Rows] AS
N'<?xml version="1.0"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
	<xs:element name="Table1Rows">
		<xs:complexType>
			<xs:sequence>
				<xs:element maxOccurs="unbounded" name="Table1Row">
					<xs:complexType>
						<xs:sequence>
							<xs:element name="Value1" type="xs:integer" minOccurs="1" maxOccurs="1" />
							<xs:element name="Value2" type="xs:string" minOccurs="1" maxOccurs="1" />
						</xs:sequence>
					</xs:complexType>
				</xs:element>
			</xs:sequence>
		</xs:complexType>
	</xs:element>
</xs:schema>';

 

Finally, here is the meat of the test script, which builds up an XML document, and then shred/INSERTs it into the table. Comparison test by uncommenting the (Table1Rows) part of the XML variable declaration on line 15.

DECLARE @sourceString nvarchar(MAX) =
	CONVERT
	(
		nvarchar(MAX),
		(
			SELECT TOP 100
				CAST(v.number AS nvarchar(4)) AS Value1,
				REPLICATE('a', CAST((v.number / 50) AS int)) AS Value2
				FROM master..spt_values v
				WHERE v.type = 'P'
				FOR XML PATH('Table1Row'), ROOT('Table1Rows')
		)
	);

DECLARE @sourceXml xml/*(Table1Rows)*/ = @sourceString;

DECLARE @i int = 100;
DECLARE @startTime datetime2;
DECLARE @endTime datetime2;

SET NOCOUNT ON;

TRUNCATE TABLE [dbo].[Table1];

SET @startTime = SYSDATETIME();

WHILE @i >= 0
BEGIN

	INSERT INTO [dbo].[Table1](Value1, Value2)
		SELECT
			n.x.value(N'Value1[1]', 'int'),
			n.x.value(N'Value2[1]', 'nvarchar(50)')
			FROM @sourceXml.nodes(N'Table1Rows[1]/Table1Row') n(x);
			
	SET @i -= 1;
	
END

SET @endTime = SYSDATETIME();

SELECT DATEDIFF(MILLISECOND, @startTime, @endTime);

 

While the results on your machine will no doubt vary versus mine, my test on a 2008 R2 instance took about 3,100 milliseconds without using the XML Schema Collection, and about 280 ms with. This is a huge improvement! And we can see the difference in the query plans.

Before:

Query Plan Before

 

After:

Query Plan After

 

When I started using XML Schema Collections, I was only using them to strongly-type my XML, but it turns out that shredding XML can see a significant performance improvement as a side-effect. I read that strongly-typing XML slows down writes to XML variables, but I have yet to encounter a situation where that was necessary, or indeed, a good idea.

If you do start using XML Schema Collections in your own applications, I have a slight caution: you may want to avoid strongly typing parameters that are exposed in the database API (i.e., stored procedure parameters), even though it's technically correct to construct the parameters that way. The reason why is that XML Schema Collections, once created, cannot be altered in-place (there is an ALTER XML SCHEMA COLLECTION, but it doesn't do what you might expect). To change the definition, a collection must be dropped and recreated, which means that all objects with strong dependencies must be dropped and recreated as well. I think this is a royal pain -- feel free to vote up the Connect feature request to add the expected functionality.

A workaround is to weakly type the API parameters, but immediately cast the parameter to a strongly-typed local variable in the procedure body, like so:

CREATE PROCEDURE [dbo].[Proc1](@p1 xml)
AS
BEGIN

	DECLARE @realP1 xml(MySchemaCollection) = @p1;
	
	...

 

From an API point of view, that's not the best solution as the strong type isn't exposed to the outside, but IMO, it's a good enough tradeoff from a maintenance point of view, particularly if the Schema Collection is reused in many places.

Sep
18
2012

Will the Real Query Text Please Stand Up?

(While I can't take credit for finding this issue originally -- I heard about it several months ago -- for the life of me I can't remember or find where it was mentioned, so if you know of a reference, please leave a comment.)

When I first mentioned that developers should try to move away from using single-line comments (using --) and towards block comments (using /* */) in T-SQL, it was understandable that I got a bunch of "ummm... what?" looks.

All comments are created equal, right?

Well, obviously not, or this post wouldn't need to exist.

Aside from the obvious differences, there's a particular issue with single-line comments that's indirect and somewhat subtle, yet it can be significant. If you've worked with SQL Server Profiler for any length of time, there is a decent chance you've come across this issue already. Whether you realized it or not is a separate question entirely.

Onwards.

The problem is that Profiler word-wraps captured SQL statements according to its window size. While that might seem okay at first glance, the fact is that artificially introducing extra newlines when the statement syntax depends on the arrangement of newlines... can lead to interesting things. Not only that, but the syntax highlighting is applied after the word-wrapping takes place, which exacerbates the problem.

Observe:

SELECT
	fulltext_catalog_id--, name, path, is_default, is_accent_sensitivity_on, data_space_id, file_id, principal_id, is_importing
	FROM sys.fulltext_catalogs

This query selects a single column from sys.fulltext_catalogs, while the other columns are commented out.

If I shrink the Profiler window (both to demonstrate the problem, and so screenshots nicely fit onto this page) and run the query, the statement appears like this:

That... doesn't look right.

If you had only seen the query in Profiler and didn't know what the original query was, do you know which query was actually executed?

The really dangerous thing is that if you inspect the erroneous-looking query closely, you'll find that it is a different, yet syntactically valid, query.

SELECT
	fulltext_catalog_id--, name, path, is_default, is_accent_sensitivity_on,
data_space_id, file_id, principal_id, is_importing
	FROM sys.fulltext_catalogs

(If you don't see it, this query aliases the fulltext_catalog_id column as data_space_id, and returns 4 columns in the result set.)

Here's what you see in Profiler when you run it:

Looks familiar, doesn't it?

 

Tips:

  • Use block (/* */) comments wherever possible in production code, particularly if the comment occurs within part of a query. IMO, it's easier and safer to just get in the habit of using block comments all the time rather than picking and choosing where to use each type, or having to teach this kind of esoteric information. I rarely take issue with personal coding style as long as code is legible, but this is an exceptional circumstance that can directly affect things beyond readability.
  • Keep lines short, and break up long lines (this is a good practice for readability anyway). This can be accomplished in 99.999% of cases by using a combination of structured parentheses, indentation, and aliases.
  • Any time you see a single-line comment (--) within the query text in Profiler, don't assume what you're being shown is what was actually executed. Copy the TextData cell (Ctrl+Shift+C) into a Management Studio query window to see the real query text.

(Note: the 2008 R2 version of Profiler is shown above; the 2012 version also exhibits this behaviour.)

Sep
12
2012

Fun with Multi-Column Foreign Keys and Nullable Columns

This is more of a reminder/warning/spread-the-word post to be really, really careful when dealing with multi-column foreign key relationships when one or more of the foreign columns allows NULL values.

Things... may not work as you might expect.

Consider the following script:

CREATE TABLE PrimaryTable
(
	Col1 int NOT NULL,
	Col2 int NOT NULL,
	
	CONSTRAINT PK_PrimaryTable
		PRIMARY KEY(Col1, Col2)
);

CREATE TABLE ForeignTable
(
	Id int IDENTITY PRIMARY KEY,
	Col1 int NOT NULL,
	Col2 int NULL,
	
	CONSTRAINT FK_ForeignTable_PrimaryTable
		FOREIGN KEY(Col1, Col2) REFERENCES PrimaryTable(Col1, Col2)
);

INSERT INTO PrimaryTable(Col1, Col2)
	VALUES (0, 0), (1, 1);


-- Statement 1
INSERT INTO ForeignTable(Col1, Col2)
	VALUES (0, 0);

-- Statement 2
INSERT INTO ForeignTable(Col1, Col2)
	VALUES (0, 1);

-- Statement 3
INSERT INTO ForeignTable(Col1, Col2)
	VALUES (0, NULL);

 

Statement 1 is standard, and succeeds as we'd expect. Similarly as expected, Statement 2 fails with a constraint violation as there's no row in PrimaryTable that has that combination of values.

What about Statement 3? If there was no foreign key, (0, NULL) would be an allowed combination of values, but what happens when there is a foreign key?

The answer is that the insert still succeeds.

I had expected it would fail just like Statement 2, because there isn't a (0, NULL) combination in PrimaryTable. In fact, it's not even an allowed combination to put in that table at all!

What happens is that NULL values are completely ignored from checking, and the constraint is satisfied as long as the non-NULL values match up to at least 1 row in the primary table. Edit: this is incorrect. If there are any NULLs in the foreign table columns, the constraint is not checked at all, which is even more perilous.

Multi-column foreign keys aren't exactly the best practice in the world as far as schema design goes, but you do encounter them in the wild occasionally. If a database is suspect from an integrity point-of-view (which usually means it has many NULLable columns), be aware of this behaviour when doing data cleanup or querying. The corollary is that if this kind of schema is found in a database, poor integrity may not be very far away either.

Add this to your own Blitz script:

SELECT
	fk.name AS ConstraintName,
	COUNT(*) AS ColumnCount
	FROM sys.foreign_keys fk
	INNER JOIN sys.foreign_key_columns fkc ON fkc.constraint_object_id = fk.object_id
	GROUP BY fk.name
	HAVING COUNT(*) >= 2;