Nov
15
2012

How much overhead do CHECK constraints have?

When designing new database structures, I feel it's really important to make sure that the data goes into the new tables as cleanly as possible. Doing this early on helps prevent bugs from creeping into the system as its developed. And preventing bugs as early as possible in the development cycle is almost certainly the most cost-efficient strategy -- cleaning up bad data later on is extremely expensive, time consuming, and frustrating.

One of the tools that can be used to reject bad data values going into tables are CHECK constraints. This type of constraint is meant to be used to validate a single row based on a predicate that's a function of any combination of the table columns in that row. (CHECK constraints can be used for things beyond this definition... but doesn't mean they should be.)

In this post, I'm not going to address any more of the advantages of constraining data values. What I do want to investigate is performance.

The extra logic to do this validation necessarily has additional processing cost associated with it. The question is: how much extra? Does it really make a difference? What kind of general guidelines can be established for performance purposes?

 

The Setup

I decided to run two series of tests: one with a very narrow table (3 columns), and one with a wide table (21 columns). We'll INSERT a bunch of test data into each of the tables. A baseline would be established for each test, where the tables have no constraints. Then, different constraints would be added, and hopefully we'll be able to see a difference in the load times.

Here is an abridged version of the table creation script (note: all scripts are available to download in full at the bottom of this post):

-- Narrow, no constraints
CREATE TABLE [dbo].[NarrowWithoutConstraints]
(
	Id int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
	Col1 int NOT NULL,
	Col2 varchar(10) NOT NULL
);

-- Wide, no constraints
CREATE TABLE [dbo].[WideWithoutConstraints]
(
	Id int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
	Col10 int NOT NULL,
	...
	Col19 int NOT NULL,
	Col20 varchar(10) NOT NULL,
	...
	Col29 varchar(10) NOT NULL,
);

-- Narrow, simple constraints
CREATE TABLE [dbo].[NarrowWithConstraints1]
(
	Id int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
	Col1 int NOT NULL CHECK (Col1 >= 0),
	Col2 varchar(10) NOT NULL CHECK (LEN(Col2) > 0)
);

-- Wide, simple constraints
CREATE TABLE [dbo].[WideWithConstraints1]
(
	Id int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
	Col10 int NOT NULL CHECK (Col10 >= 0),
	...
	Col19 int NOT NULL CHECK (Col19 >= 0),
	Col20 varchar(10) NOT NULL CHECK (LEN(Col20) > 0),
	...
	Col29 varchar(10) NOT NULL CHECK (LEN(Col29) > 0)
);

-- Narrow, complex constraints
CREATE TABLE [dbo].[NarrowWithConstraints2]
(
	Id int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
	Col1 int NOT NULL
		CHECK ((Col1 >= 0) AND (Col1 > -9999) AND (Col1 < 5000) AND (Col1 < 9999)),
	Col2 varchar(10) NOT NULL
		CHECK ((LEN(Col2) > 0) AND (DATALENGTH(Col2) = DATALENGTH(LTRIM(RTRIM(Col2)))))
);

-- Wide, complex constraints
CREATE TABLE [dbo].[WideWithConstraints2]
(
	Id int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
	Col10 int NOT NULL
		CHECK ((Col10 >= 0) AND (Col10 > -9999) AND (Col10 < 5000) AND (Col10 < 9999)),
	...
	Col19 int NOT NULL
		CHECK ((Col19 >= 0) AND (Col19 > -9999) AND (Col19 < 5000) AND (Col19 < 9999)),
	Col20 varchar(10) NOT NULL
		CHECK ((LEN(Col20) > 0) AND (DATALENGTH(Col20) = DATALENGTH(LTRIM(RTRIM(Col20))))),
	...
	Col29 varchar(10) NOT NULL
		CHECK ((LEN(Col29) > 0) AND (DATALENGTH(Col29) = DATALENGTH(LTRIM(RTRIM(Col29))))),
);

-- Narrow, many complex constraints
CREATE TABLE [dbo].[NarrowWithConstraints3]
(
	Id int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
	Col1 int NOT NULL
		CHECK ((Col1 >= 0) AND (Col1 > -9999) AND (Col1 < 5000) AND (Col1 < 9999)),
	Col2 varchar(10) NOT NULL
		CHECK ((LEN(Col2) > 0) AND (DATALENGTH(Col2) = DATALENGTH(LTRIM(RTRIM(Col2)))))
);

ALTER TABLE [dbo].[NarrowWithConstraints3]
	ADD CHECK ((Col1 > -1) AND (Col1 > -9998) AND (Col1 < 5001) AND (Col1 < 9998));

...

ALTER TABLE [dbo].[NarrowWithConstraints3]
	ADD CHECK ((Col1 > -9) AND (Col1 > -9990) AND (Col1 < 5009) AND (Col1 < 9990));

ALTER TABLE [dbo].[NarrowWithConstraints3]
	ADD CHECK ((LEN(Col2) > -1) AND ((DATALENGTH(Col2) + 1) = (DATALENGTH(LTRIM(RTRIM(Col2))) + 1)));

...

ALTER TABLE [dbo].[NarrowWithConstraints3]
	ADD CHECK ((LEN(Col2) > -9) AND ((DATALENGTH(Col2) + 9) = (DATALENGTH(LTRIM(RTRIM(Col2))) + 9)));

 

The Test

Testing these scenarios is pretty simple: shove a bunch of data into a table, and time how long it takes. One of the test harnesses is shown below:

DECLARE @startTime datetime2;
DECLARE @endTime datetime2;
DECLARE @i int = 1250;

SET NOCOUNT ON;

TRUNCATE TABLE [dbo].[<table>];
DBCC CHECKIDENT('dbo.<table>', RESEED, 1);

SET @startTime = SYSDATETIME();

WHILE @i >= 0
BEGIN

	INSERT INTO [dbo].[<table>](Col1, Col2)
		SELECT
			number,
			CAST(number AS varchar(10))
			FROM master..spt_values v
			WHERE v.type = 'P'

	SET @i -= 1;

END

SET @endTime = SYSDATETIME();

SELECT DATEDIFF(MILLISECOND, @startTime, @endTime);

So that's pretty standard. I ran the narrow table test with @i = 1250, and the wide table test with @i = 750, which resulted in 2,560,000 and 1,536,000 rows inserted, respectively. For each table, I ran the test harness 11 times; the first result was discarded, and the rest got recorded and averaged together. Note that the table is TRUNCATEd and the identity seed reset before each run. Also, the database was in SIMPLE recovery.

 

The Results

My hypothesis was that the constraints (at least the simple ones) would add a little bit of overhead, but I wasn't sure if it would be measurable, and it was unclear how much (if anything) SQL Server would do to simplify the expressions I used (turns out it didn't simplify them at all).

  Average (ms) ms/1000 Rows % Increase
Narrow (no constraints) 12,401 0.484 -
Narrow (simple constraints) 13,034 0.509 5.1%
Narrow (complex constraints) 14,505 0.567 17.0%
Narrow (many complex constraints) 25,101 0.981 102.4%

There are a couple of very interesting things here. First of all, the difference between no constraints and a small number of simple constraints is measurable. Second -- this is what really surprised me -- a complex constraint is significantly more expensive than the simple constraint. This makes sense because functions are always going to be more expensive than something like a basic comparison. It seems that perhaps this difference is magnified in this case because the amount of work to do the actual INSERT is small relative to the total amount of work needed to be done. That said, it appears that many complex constraints could at the very least hobble an INSERT-only workload.

 

  Average (ms) ms/1000 Rows % Increase
Wide (no constraints) 12,477 0.812 -
Wide (simple constraints) 14,647 0.954 17.4%
Wide (complex constraints) 20,238 1.318 62.2%

In this test, we would expect that the differences are larger because of the number of constraints used. As a sanity check, we can see that the ratio of the increases (62.2% / 17.4% = 3.6) vs (17.0% / 5.1% = 3.3) is approximately the same.

If we compare the narrow table with many complex constraints with the wide table with complex constraints, they both have the same number and type of constraints, and the ms/1000 rows numbers have the same sort of increases (0.981 - 0.484 = 0.497) vs (1.318 - 0.812 = 0.506), respectively. This says that evaluating constraints has a fixed cost, independent of the table schema.

 

Conclusion

I certainly learned a lot from doing this experiment. There are a few takeaways from the results:

  • Disabling CHECK constraints will reduce the load time, perhaps significantly.
  • A CHECK constraint's overhead is determined by it's complexity of evaluation (comparisons good; function calls not so much).
  • Evaluating CHECK constraints has a fixed cost, and thus a smaller relative cost when the other parts of the process are more expensive.

While these aren't Earth-shattering conclusions, we did get to validate the best-practice of disabling constraints for loading, and see just how much overhead CHECK constraints can add.

What I want to stress here is that this test was for a relatively narrow situation involving only INSERTs. If your workload is 100% INSERTs, then you definitely need to pay attention to these kinds of things, but if your system isn't INSERTing 200k rows/sec -- or even anywhere near that -- it's likely that the benefits of the constraints will outweigh the performance penalty, as the overhead will be a small portion of the overall workload.

 

Full scripts and test results:

01 - Create Tables.sql

02 - Narrow Table Test Harness.sql

03 - Wide Table Test Harness.sql

Test Results.xlsx

Nov
8
2012

What changed in my database last week?

PANIC! A customer clicked through four layers of warning messages and accidentally deleted a bunch of data from the database, but didn't bother to tell you about it until today.

Great. The database is in FULL or BULK_LOGGED recovery, and we have a full set of transaction log backups that contain all the transactions in the database for the time when things "happened." Okay... now what? Log backups seem very opaque, as we can't just open them up in Notepad and expect things to be human-readable.

Enter the undocumented table-valued function: fn_dump_dblog.

This function reads a transaction log backup file and returns a human-readable geek-readable description of the physical log records from the backup.

(The sister function fn_dblog does the same sort of thing, except it operates only on the active transaction log of a database, not a transaction log backup.)

Paul Randal wrote a detailed blog entry on the basics of how to use both of these functions, so I won't cover that here. Where this post differs is in consuming the output of the function so it's much more usable.

The first step is to read the entire backup and dump the output into a temporary table. This will make querying the same set of transactions (or more transactions if you load additional backups into the table) much faster, as the log reading part of things is rather slow.

--INSERT INTO #transactions SELECT *
--SELECT * INTO #transactions
	FROM
		fn_dump_dblog
		(
			NULL,	-- Start
			NULL,	-- End
			'DISK',	-- Device Type
			1,		-- File number in backup
			'',		-- Backup file
			NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
			NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
			NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
			NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
			NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
			NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
			NULL, NULL, NULL
		);

ALTER TABLE #transactions
	ADD
		StartDate datetime NULL,
		EndDate datetime NULL;

UPDATE #transactions
	SET
		StartDate = CAST([Begin Time] AS datetime),
		EndDate = CAST([End Time] AS datetime)

ALTER TABLE #transactions DROP COLUMN [Begin Time];
ALTER TABLE #transactions DROP COLUMN [End Time];

 

Now that we have the transactions available for querying more readily, let's show what we need to see in an easy-to-consume format. This works best if you've restored a copy of the database in STANDBY mode to a point in time before the time of interest. If the script is run in the context of that database, the code will show you the names of the tables affected, the login SID of who made the change, and also proactively return a DBCC PAGE command for when you want to look at the raw data values. This makes it really easy to inch through the transaction log to figure out what changed using out-of-the-box tools. (Yes, there are 3rd-party tools that do this, too.)

DECLARE @startDate datetime = NULL;
DECLARE @endDate datetime = NULL;
DECLARE @minLSN varchar(22) = NULL; /* '00000000:00000000:0000' */
DECLARE @maxLSN varchar(22) = NULL; /* '00000000:00000000:0000' */

SELECT
	a.TransactionId,
	a.Seq,
	a.LSN,
	a.SID,
	a.StartDate AS TransactionStartDate,
	a.EndDate AS TransactionEndDate,
	a.Operation,
	a.TableName,
	a.FileNumber,
	a.PageNumber,
	a.SlotId,
	(
		CASE WHEN a.FileNumber IS NOT NULL THEN
			'DBCC PAGE (''' + DB_NAME() + N''', ' + CAST(a.FileNumber AS varchar(MAX)) + ', ' + CAST(a.PageNumber AS varchar(MAX)) + ', 3) WITH TABLERESULTS'
		END
	) AS DBCCPageCommand
	FROM
	(
		SELECT
			UPPER(t.[Transaction ID]) AS TransactionId,
			ROW_NUMBER() OVER(PARTITION BY t.[Transaction ID] ORDER BY t.[Current LSN]) AS Seq,
			UPPER(t.[Current LSN]) AS LSN,
			bt.StartDate,
			ct.EndDate,
			t.Operation,
			CAST(CONVERT(varbinary, UPPER(LEFT(t.[Page ID], 4)), 2) AS int) AS FileNumber,
			CAST(CONVERT(varbinary, UPPER(RIGHT(t.[Page ID], 8)), 2) AS int) AS PageNumber,
			t.[Slot ID] AS SlotId,
			o.name AS TableName,
			bt.[Transaction SID] AS SID
			FROM #transactions t
			LEFT OUTER JOIN #transactions bt ON ((bt.[Transaction ID] = t.[Transaction ID]) AND (bt.Operation = 'LOP_BEGIN_XACT'))
			LEFT OUTER JOIN #transactions ct ON ((ct.[Transaction ID] = t.[Transaction ID]) AND (ct.Operation = 'LOP_COMMIT_XACT'))
			LEFT OUTER JOIN
			(
				sys.partitions p
				INNER JOIN sys.objects o ON o.object_id = p.object_id
			) ON p.partition_id = t.PartitionId
			WHERE
				(t.Context IN ('LCX_CLUSTERED', 'LCX_HEAP')) AND
				(t.[Transaction ID] != N'0000:00000000') AND
				((@startDate IS NULL) OR (t.StartDate IS NULL) OR (t.StartDate >= @startDate)) AND
				((@endDate IS NULL) OR (t.EndDate IS NULL) OR (t.EndDate <= @endDate)) AND
				((@minLSN IS NULL) OR (t.[Current LSN] >= @minLSN)) AND
				((@maxLSN IS NULL) OR (t.[Current LSN] <= @maxLSN))
	) a
	ORDER BY
		a.StartDate,
		a.EndDate,
		a.TransactionId,
		a.LSN;

 

If you feel like playing around, there are many more fields that come back from the function; I've chosen to output the set of columns that I find most useful when I need to use this script.

Once you've identified when the change occurred, you can run a data comparison tool between the STANDBY database, and the current database (or a STANDBY copy from immediately after the change).

A copy of the full script can be downloaded here.

Nov
6
2012

XML Schema Collections: the XML accelerator pedal

While some of us may disagree, I think XML is a great method of moving small-to-medium sized data sets into SQL Server.

Because of XML's generality, it can be used by pretty much any client application and technology you can think of, and this can be a big advantage to getting the job done quickly and flexibly. I personally like it most for passing in sets of parameters (i.e., multiple ids) into stored procedures, and sending in full sets of data values where it's much easier to consolidate them on the SQL side rather than using complicated application logic.

The downside is that the XML has to be shredded (i.e. parsed) on the SQL Server side, which uses CPU. Out of all the tiers in our application stack, SQL Server is the most expensive to scale up, so it's important to be mindful when putting operations with large processing overhead into production. Ideally, we'd like to use XML with as little overhead as possible.

When we tell SQL Server to shred an XML document into a rowset, it can't make assumptions about what the incoming XML is going to look like. In fact, by default, SQL Server assumes there could be multiple root elements in the document (i.e., it's an XML fragment)!

By using XML Schema Collections, we tell SQL Server what to expect, and the optimizer is able to tailor the query plan to only include the necessary physical operations. A great side-effect of this is that the incoming XML is automatically validated against the schema, essentially making the input "strongly typed" like we would with regular variables.

Let's walk through an example to demonstrate just how much of a difference this can make. (Note: XML Schema Collections are available in 2005+, but this example code is written for 2008+.)

First, I'll create a table into which I'll dump the shredded data:

CREATE TABLE [dbo].[Table1]
(
	Id int IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
	Value1 int NOT NULL,
	Value2 nvarchar(50) NOT NULL
);

 

Next, I'll create the XML Schema Collection for comparison testing. You can see all the XML Schema Collections in a database through Management Studio in the Programmability | Types | XML Schema Collections folder.

CREATE XML SCHEMA COLLECTION [dbo].[Table1Rows] AS
N'<?xml version="1.0"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
	<xs:element name="Table1Rows">
		<xs:complexType>
			<xs:sequence>
				<xs:element maxOccurs="unbounded" name="Table1Row">
					<xs:complexType>
						<xs:sequence>
							<xs:element name="Value1" type="xs:integer" minOccurs="1" maxOccurs="1" />
							<xs:element name="Value2" type="xs:string" minOccurs="1" maxOccurs="1" />
						</xs:sequence>
					</xs:complexType>
				</xs:element>
			</xs:sequence>
		</xs:complexType>
	</xs:element>
</xs:schema>';

 

Finally, here is the meat of the test script, which builds up an XML document, and then shred/INSERTs it into the table. Comparison test by uncommenting the (Table1Rows) part of the XML variable declaration on line 15.

DECLARE @sourceString nvarchar(MAX) =
	CONVERT
	(
		nvarchar(MAX),
		(
			SELECT TOP 100
				CAST(v.number AS nvarchar(4)) AS Value1,
				REPLICATE('a', CAST((v.number / 50) AS int)) AS Value2
				FROM master..spt_values v
				WHERE v.type = 'P'
				FOR XML PATH('Table1Row'), ROOT('Table1Rows')
		)
	);

DECLARE @sourceXml xml/*(Table1Rows)*/ = @sourceString;

DECLARE @i int = 100;
DECLARE @startTime datetime2;
DECLARE @endTime datetime2;

SET NOCOUNT ON;

TRUNCATE TABLE [dbo].[Table1];

SET @startTime = SYSDATETIME();

WHILE @i >= 0
BEGIN

	INSERT INTO [dbo].[Table1](Value1, Value2)
		SELECT
			n.x.value(N'Value1[1]', 'int'),
			n.x.value(N'Value2[1]', 'nvarchar(50)')
			FROM @sourceXml.nodes(N'Table1Rows[1]/Table1Row') n(x);
			
	SET @i -= 1;
	
END

SET @endTime = SYSDATETIME();

SELECT DATEDIFF(MILLISECOND, @startTime, @endTime);

 

While the results on your machine will no doubt vary versus mine, my test on a 2008 R2 instance took about 3,100 milliseconds without using the XML Schema Collection, and about 280 ms with. This is a huge improvement! And we can see the difference in the query plans.

Before:

Query Plan Before

 

After:

Query Plan After

 

When I started using XML Schema Collections, I was only using them to strongly-type my XML, but it turns out that shredding XML can see a significant performance improvement as a side-effect. I read that strongly-typing XML slows down writes to XML variables, but I have yet to encounter a situation where that was necessary, or indeed, a good idea.

If you do start using XML Schema Collections in your own applications, I have a slight caution: you may want to avoid strongly typing parameters that are exposed in the database API (i.e., stored procedure parameters), even though it's technically correct to construct the parameters that way. The reason why is that XML Schema Collections, once created, cannot be altered in-place (there is an ALTER XML SCHEMA COLLECTION, but it doesn't do what you might expect). To change the definition, a collection must be dropped and recreated, which means that all objects with strong dependencies must be dropped and recreated as well. I think this is a royal pain -- feel free to vote up the Connect feature request to add the expected functionality.

A workaround is to weakly type the API parameters, but immediately cast the parameter to a strongly-typed local variable in the procedure body, like so:

CREATE PROCEDURE [dbo].[Proc1](@p1 xml)
AS
BEGIN

	DECLARE @realP1 xml(MySchemaCollection) = @p1;
	
	...

 

From an API point of view, that's not the best solution as the strong type isn't exposed to the outside, but IMO, it's a good enough tradeoff from a maintenance point of view, particularly if the Schema Collection is reused in many places.

Oct
31
2012

Managing the size of the transaction log

When we manage databases in either the FULL or BULK_LOGGED recovery models, we need to pay particular attention to the size of the transaction log files. If our processes aren't optimal, we can see log files grow either out of control, or beyond what we think is a reasonable size.

 

Virtual Log Files

As I mentioned in a previous post, the transaction log contains an ordered sequence of the physical operations that occur on a database.

What I didn't mention is that the physical transaction log file also contains logical sections, called virtual log files (VLFs). It's the VLFs which actually contain the physical operations I mentioned. The purpose of VLFs is to allow SQL Server to more efficiently manage the log records; specifically, to know which portions of the transaction log are used, and which aren't.

Knowing the portions of the log that are used is important when we go to take a transaction log backup, which creates a copy of all the transaction log records since the last transaction log backup. After a transaction log backup, as long as all the log records within a VLF are not part of an active transaction (or are required for some other feature, such as replication), the VLF can be marked as unused/inactive. This allows SQL Server to reuse that portion of the log file. This process is called many names, including clearing the log and log truncation. It does not affect the physical size of the log file.

Problems only start happening if all the VLFs end up being used, and none are available1. This means the log file (the physical file) has to grow to accommodate more transactions. A physical log file growth automatically creates new unused VLFs, and so the transactions can continue.

 

What is causing the log to grow after I shrank it?

Any write operation in the database can potentially cause the log file to grow. The real question is: why are all the VLFs used up when that's not what is expected?

Here are some of the problems you might encounter:

  • Not taking transaction log backups frequently enough, or not taking them at all. The only way to mark VLFs as inactive is to take a transaction log backup (again, only for FULL and BULK_LOGGED databases). If transaction log backups are only taken once/day, besides exposing yourself to a full day of data loss, you also need a transaction log file large enough to hold all the transactions that occur for a full day. That could be quite a bit of space! The solution is obvious, and it's a win for recoverability, too: take transaction log backups more frequently.
  • Overly aggressive and/or too frequent index maintenance. I'm certainly a proponent of index maintenance, but it's very easy to get carried away. Maintenance plans that rebuild shiny new indexes for you every night are both ridiculously easy to set up, and also ridiculously bad for the size of the transaction log. Typically, rebuilding all the indexes of a database takes more log space than the data files actually take up. If you're dealing with databases of even small-to-moderate sizes (say, 1 GB and larger), this can add up to a tremendous amount of wasted storage space, because all that transaction log has to be backed up at some point, and the log file will likely end up larger than the data files. What I strongly recommend doing is putting in place a much more selective index maintenance process, such as Ola Hallengren's index maintenance scripts, which I use in production.
  • The database is in FULL or BULK_LOGGED when it actually should be in SIMPLE. Before you go flipping the settings around (which can potentially cause data loss), go read about choosing the right recovery model for the database. Typically, I see this happen when a production backup is restored to a development environment, and a backup gets taken for some reason. And then the transaction log ends up growing, and growing, and growing as developers test and roll back their work, right up until things screech to a halt when the drive that contains the log file has filled up.
  • A SQL Server feature, such as replication or mirroring, is enabled. Until the transactions have been copied to the secondary server, the records cannot be cleared from the transaction log. If the secondary is slow applying changes, or if you don't even have a secondary any more and things weren't cleaned up correctly, this could lead to infinite transaction log growth. To solve this really depends on where the specific problem is -- solve that first, and then see if there is still an issue with the transaction log.

 

If none of those cover your scenario, you can run the following query, which will tell you why the transaction log is still in use.

SELECT
    name,
    log_reuse_wait_desc,
    recovery_model_desc
    FROM sys.databases

 

1 Extra empty log space is pre-allocated such that there is always enough space for active transactions to be able to roll back. Regardless, that space is "used" in the sense that it's been spoken for already.

Oct
25
2012

Consuming a secure web service in SSIS

If you need to use SSIS to consume methods of a web service that require a client certificate, the first thing you need to know is this: the Web Service Task will not get you there. (Regardless of its... other issues.)

The Properties GUI is misleading in that you can specify a certificate to test the connection and evaluate the methods, but that's as far as it goes -- the certificate information isn't passed along to the underlying HTTP Connection Manager at runtime, and you end up with "403 forbidden" errors for no apparent reason.

The HTTP Connection Manager does have a very tantalizing Certificate property... which can't be set using an Expression. (Or at least I haven't figured out how.)

We would have to resort to using a Script Task (or Script Component) to set the Certificate property, but going that route, it's actually easier to take a different approach entirely within the task.

First, though, let's take a step back, because we still need a way to get the certificate so it can be used with the secure service.

Below is a Script Task function that will return a reference to a certificate based on a certificate store location and a certificate serial number (run certmgr.msc to view the local certificate store).

Note that this is not a complete solution! You'll probably want to keep this code in its own script task which sets a package variable, so the certificate is available for all the web service calls you need to make. Also, it would be a good idea to externalize the input parameters so your package is configurable. I'm showing it this way here for simplicity.

 

using System.Security.Cryptography.X509Certificates;

private X509Certificate2 _GetClientCertificate(
	StoreLocation storeLocation, string serialNumber)
{
	X509Store store = new X509Store(storeLocation);
            
	store.Open(OpenFlags.ReadOnly);

	try
	{
		foreach (X509Certificate2 cert in store.Certificates)
		{
			if (cert.SerialNumber.CompareTo(serialNumber) == 0)
				return cert;
		}

		// No match
		throw new ArgumentException();
	}
	finally
	{
		store.Close();
	}
}

 

The next step is to configure a Script Task to actually call the web service. First, create the new task or component and go into the Visual Studio code editor. Right-click on the project file, and use Add Web Reference to generate proxy classes for your web service.

Now, here is where I've had a bit of frustration. Sometimes exiting out of Visual Studio at this point does not correctly save the project file, and you end up with the web reference files in the file system, but not actually in the project. There's no way to "add" them back to the project the way they were. Sadly, the easiest way I've found to clean it up... is to start again with a new Script Task. So what I've tried to do is use the Save All function to basically hope and pray that it sticks, then exit out, and go back in to make sure the folder still appears in the project. If it's still there, we're good to proceed.

At this point, try to build the project by using the Build | Build st_<guid> menu item. If you get an error "Task failed because "sgen.exe" was not found, or the correct Microsoft Windows SDK is not installed. ..." open the project properties, go into the Build tab, and change the Generate serialization assembly option to Off. The project should build successfully now.

So after all this leadup, here is the code to actually consume the web service in the Script Task (or Script Component):

using Services = st_<guid>.csproj.MyService;

Services.MyService svc = new Services.MyService();
svc.Url = "https://www.mydomain.com/MyService.svc";
svc.ClientCertificates.Add(
	_GetClientCertificate(StoreLocation.CurrentUser, "123456"));

svc.MyServiceMethod();

 

If you need to make many calls to the same web service, it's possible to add a reference to an external assembly in the Script Task project file, instead of generating the proxy classes directly inside the project. While the steps needed to do this are beyond the scope of this post, a common assembly is a great way to centralize the logic and service references in a larger project.