Efficiently Indexing Long Character Strings

Sometimes we face the requirement to search a table for a specific string value in a column, where the values are very long. The default answer is to create an index, but we must remember that index keys cannot exceed a 900-byte size limit, and LOB columns (which include the MAX type variants) cannot be in the index key at all. In this post, we’ll explore a method to efficiently index long character strings where the search operation is moderately-to-highly selective.

The Methodology

Since we’re faced with a size restriction on the index key (or the string column cannot be included in the key at all), we need to apply some kind of transformation to the string data to make its representation — and hence the index key size required — smaller.

Of course, we can’t simply compress an infinite amount of data into a comparatively very small field, so the index we create won’t be able to support all the same operations as if we had indexed the original field. But most of the time, that isn’t a problem — generally, we need to search for an exact match, or a string starting (or maybe even ending) with a given few characters.

We’re going to do this by performing a computation on the original values, storing the result in a new column in the table, and creating an index on that column (there are two ways to do this, which we’ll get into shortly). In terms of the computation itself, there are a few different options that can be used, depending on what needs to be accomplished. I’ll introduce two basic approaches here, and then we’ll run through some code that demonstrates one of them.

  1. Use the LEFT function to grab the first several characters and discard the rest. This method is appropriate for both exact matching and wildcard searching with a few given characters (i.e., LIKE ‘abc%’), and because the original text stays in tact, it supports case/accent-insensitive matching without doing anything special. The drawback of this method, though, is that it’s really important to know the data. Why? To improve the efficiency of queries, the number of characters to use should be enough such that the computed values are more or less unique (we’ll see why in the example). It may in fact be the case that the original field always has a consistent set of characters at the beginning of the string (maybe they’re RTF documents or something), so most of the computed values would end up being duplicates (if that’s the case, this approach won’t work and you’ll want to use the other approach instead).
  2. Use a traditional hash function (such as HASHBYTES) to compute a numeric representation of the original field. This method supports exact matching (case-insensitivity with a small tweak, and accent-insensitivity with a bit of code), but not partial matching because the original text is lost, and there’s no correlation between the hash value and the original text. It’s also critical to watch the data types involved here, because hash functions operate on bytes, not characters. This means hashing an nvarchar(MAX) value N’abc’ will not give the same result as hashing a varchar value ‘abc’, because the former is always 2 bytes per character. Those are the downsides. The upside is that traditional hashing has the ability to work on data that contains very few differences, and when the text starts with the same sequence of characters.

Both of these methods are hash functions, as they convert a variable length input to a fixed length output. I suppose the first approach isn’t technically a hash function because the input could be shorter than the number of characters desired in the output, but I think you see my point that both approaches “sample” the original data in a way that allows us to represent the original value more or less uniquely without needing to carry around the original data itself. Maybe it’s more correct to speak of them as lossy compression algorithms. In any event, I will refer to the results of these functions as hash values for the remainder of the post.

Because this method omits some (or probably most) of the original data, some extra complexity in our code must be introduced to compensate for the possibility of duplicate hash values (called a hash collision). This is more likely to happen using the first approach as the original data values are partially exposed. It’s still possible, however rare, that the second approach can have duplicates, too, even if the input text is unique, so we must accommodate for that.

Speaking specifically about using a traditional hash function, such as HASHBYTES, there are a couple major things to consider. First, there are several different algorithms to choose from. Because we aren’t going to be using the hash value for any type of cryptographic functionality (i.e., we don’t care about how difficult it will be to “unhash” the values), it’s largely irrelevant which algorithm we choose, except to pick one that produces a hash value of an acceptable size. A larger hash value size will reduce the possibility of collisions as there are more possible combinations, but as I already mentioned, we have to handle for collisions anyway because no hash function is perfect; therefore, it’s okay to use a smaller hash value size to save storage space, and pick a hash algorithm that is least expensive to compute initially.1

Second, we need to make sure that the hash values represent a consistent portion of the string input values. HASHBYTES is great because it’s built-in, but its input parameter is limited to 8000 bytes. This means if the string values needed to be hashed are longer than 8000 char/4000 nchar, a slightly different approach should be taken. I would recommend creating a CLR scalar function, as there are no limitations of input size in the .NET hashing classes.

The Test

As I mentioned previously, there are two ways to store the hash values so they can be indexed:

  1. Computed column. This method has two advantages: first, the SQL Server engine automatically keeps the hash value up-to-date if the source column changes. Second, if we use a non-PERSISTED column, we save the storage space for the hash values in the base table. The main disadvantage is that we may have to poke and prod our SQL statements to get the plan shape we want; this is a limitation of computed columns in general, not of what we’re doing here specifically. Note: if your goal is to create hash values in two tables and JOIN them together on that column, it may not be possible to do it efficiently using computed columns because of bizarre optimizer behaviour. Any time computed columns are involved, always inspect the execution plan to make sure SQL Server is doing what’s expected — sometimes common sense doesn’t make it into the execution plan that gets run.
  2. Base table column, updated by triggers (if required). This method works best to get execution plan shapes that we want, at the expense of a bit of complexity to keep the values updated, and the additional storage space required in the base table.

In the example below, we’ll use the base table column method, as it’s the most flexible, and I won’t have to go into detail about hinting in the SQL statements.

Let’s start by creating a table to play with:

CREATE TABLE [dbo].[Table1]
(
	Id int NOT NULL IDENTITY
		PRIMARY KEY CLUSTERED,
		
	StringCol varchar(MAX) NOT NULL
);
GO

And then we’ll populate it with some test data (note: the total size of the table in this demo will be about 70 MB):

SET NOCOUNT ON;
GO

INSERT INTO [dbo].[Table1](StringCol)
	SELECT
		REPLICATE('The quick brown fox jumps over the lazy dog. ', 5) +
			CONVERT(varchar(MAX), NEWID())
		FROM master..spt_values v
		WHERE v.type = 'P';
GO 64
	
INSERT INTO [dbo].[Table1](StringCol)
	VALUES ('The quick brown fox jumps over the lazy dog.');
GO 3

So we’ve got about 130,000 random-type strings, and then 3 rows which we can predict. Our goal is to efficiently find those 3 rows in the test queries.

Since the string field was defined as varchar(MAX), we can’t create an index on that column at all, and the only index we have in the base table is the clustered index on the Id column. If we try a naive search at this point, we get a table scan, which is the worst case scenario:

DECLARE @search varchar(MAX) =
	'The quick brown fox jumps over the lazy dog.';

SELECT Id, StringCol
	FROM [dbo].[Table1]
	WHERE StringCol = @search;

Indexing Long Character Strings 1

We get our 3 rows back, but this is far from ideal. I’m not sure why the optimizer chooses a scan and filter instead of pushing down the predicate into the scan operator itself, but that’s beside the point because this is an awful plan, with an estimated cost of over 3.5.

Let’s move on and create our hash value column and populate it. Note that if you want to update the hash values using triggers and keep the column NOT NULL, you’ll need to add a default constraint so INSERTs don’t fail outright before the trigger code gets a chance to run.

ALTER TABLE [dbo].[Table1]
	ADD StringHash binary(16) NULL;
GO
	
UPDATE [dbo].[Table1]
	SET StringHash = CONVERT(binary(16), HASHBYTES('MD5', StringCol));
GO

ALTER TABLE [dbo].[Table1]
	ALTER COLUMN StringHash binary(16) NOT NULL;
GO

We have our hash values in the table, but still no index to improve our query. Let’s create the index now:

CREATE NONCLUSTERED INDEX IX_Table1_StringHash
	ON [dbo].[Table1](StringHash);

While this looks unremarkable, the index has been created non-unique on purpose — first and foremost to accommodate duplicate source (and hence hash) values, but also to allow for the possibility of hash collisions. Even if your source values are guaranteed unique, this index should be non-unique. Other than that, this should be enough to let us construct a query to make the search more efficient:

DECLARE @search varchar(MAX) =
	'The quick brown fox jumps over the lazy dog.';

SELECT Id, StringCol
	FROM [dbo].[Table1]
	WHERE
		(StringHash = CONVERT(binary(16), HASHBYTES('MD5', @search))) AND
		(StringCol = @search);

You’ll notice I’ve still included the StringCol = @search predicate in the WHERE clause — this is to ensure correct results of the query due to hash collisions. If all we did was compare the hash values, we could end up with extra rows in the results. Here’s the execution plan of the query above:

Indexing Long Character Strings 2

We got an index seek, which was the main thing we were looking for. The Key Lookup is expected here, because we have to compare the original values as well, and those can only come from the base table. You can now see why I said this method only works for moderately-to-highly selective queries, because the key lookup is required, and if too many rows are being selected, these random operations can kill performance (or the optimizer may revert to a table scan). In any event, we now have an optimal query, and even with the key lookups happening, the estimated cost was 0.0066, an improvement by over 500x on this smallish table.

Takeaways

I’m sure at this point you can think of several other uses for this methodology; it can be applied to large binary fields, too. You could even implement it for cases where the column could be indexed directly (say, varchar(500), with most of the values taking up nearly all the allowed space), but maybe you don’t want to — hashing the data might save a significant amount of storage space, of course at the expense of code complexity and a bit of query execution time.

While hashing data is nothing new, this technique exploits hashing to greatly improve the efficiency of data comparisons. I’ll leave it as an exercise for the reader to try the JOINing scenario I mentioned above — if you do, try it using computed columns first for maximum learning — you’ll find that this technique also improves the queries, as long as the join cardinality remains low (like in the 1-table scenario, to control the number of key lookups).

 

1 While I haven’t tested this in SQL Server either using HASHBYTES or a CLR function, there are implementations of certain hash algorithms in processor hardware itself. This may be of benefit to reduce CPU usage, possibly at the expense of storage depending on the algorithms in question.

Will the Real Query Text Please Stand Up?

(While I can’t take credit for finding this issue originally — I heard about it several months ago — for the life of me I can’t remember or find where it was mentioned, so if you know of a reference, please leave a comment.)

When I first mentioned that developers should try to move away from using single-line comments (using –) and towards block comments (using /* */) in T-SQL, it was understandable that I got a bunch of “ummm… what?” looks.

All comments are created equal, right?

Well, obviously not, or this post wouldn’t need to exist.

Aside from the obvious differences, there’s a particular issue with single-line comments that’s indirect and somewhat subtle, yet it can be significant. If you’ve worked with SQL Server Profiler for any length of time, there is a decent chance you’ve come across this issue already. Whether you realized it or not is a separate question entirely.

Onwards.

The problem is that Profiler word-wraps captured SQL statements according to its window size. While that might seem okay at first glance, the fact is that artificially introducing extra newlines when the statement syntax depends on the arrangement of newlines… can lead to interesting things. Not only that, but the syntax highlighting is applied after the word-wrapping takes place, which exacerbates the problem.

Observe:

SELECT
	fulltext_catalog_id--, name, path, is_default, is_accent_sensitivity_on, data_space_id, file_id, principal_id, is_importing
	FROM sys.fulltext_catalogs

This query selects a single column from sys.fulltext_catalogs, while the other columns are commented out.

If I shrink the Profiler window (both to demonstrate the problem, and so screenshots nicely fit onto this page) and run the query, the statement appears like this:

That… doesn’t look right.

If you had only seen the query in Profiler and didn’t know what the original query was, do you know which query was actually executed?

The really dangerous thing is that if you inspect the erroneous-looking query closely, you’ll find that it is a different, yet syntactically valid, query.

SELECT
	fulltext_catalog_id--, name, path, is_default, is_accent_sensitivity_on,
data_space_id, file_id, principal_id, is_importing
	FROM sys.fulltext_catalogs

(If you don’t see it, this query aliases the fulltext_catalog_id column as data_space_id, and returns 4 columns in the result set.)

Here’s what you see in Profiler when you run it:

Looks familiar, doesn’t it?

Tips:

  • Use block (/* */) comments wherever possible in production code, particularly if the comment occurs within part of a query. IMO, it’s easier and safer to just get in the habit of using block comments all the time rather than picking and choosing where to use each type, or having to teach this kind of esoteric information. I rarely take issue with personal coding style as long as code is legible, but this is an exceptional circumstance that can directly affect things beyond readability.
  • Keep lines short, and break up long lines (this is a good practice for readability anyway). This can be accomplished in 99.999% of cases by using a combination of structured parentheses, indentation, and aliases.
  • Any time you see a single-line comment (–) within the query text in Profiler, don’t assume what you’re being shown is what was actually executed. Copy the TextData cell (Ctrl+Shift+C) into a Management Studio query window to see the real query text.

(Note: the 2008 R2 version of Profiler is shown above; the 2012 version also exhibits this behaviour.)

Fun with Multi-Column Foreign Keys and Nullable Columns

This is more of a reminder/warning/spread-the-word post to be really, really careful when dealing with multi-column foreign key relationships when one or more of the foreign columns allows NULL values.

Things… may not work as you might expect.

Consider the following script:

CREATE TABLE PrimaryTable
(
	Col1 int NOT NULL,
	Col2 int NOT NULL,
	
	CONSTRAINT PK_PrimaryTable
		PRIMARY KEY(Col1, Col2)
);

CREATE TABLE ForeignTable
(
	Id int IDENTITY PRIMARY KEY,
	Col1 int NOT NULL,
	Col2 int NULL,
	
	CONSTRAINT FK_ForeignTable_PrimaryTable
		FOREIGN KEY(Col1, Col2) REFERENCES PrimaryTable(Col1, Col2)
);

INSERT INTO PrimaryTable(Col1, Col2)
	VALUES (0, 0), (1, 1);


-- Statement 1
INSERT INTO ForeignTable(Col1, Col2)
	VALUES (0, 0);

-- Statement 2
INSERT INTO ForeignTable(Col1, Col2)
	VALUES (0, 1);

-- Statement 3
INSERT INTO ForeignTable(Col1, Col2)
	VALUES (0, NULL);

Statement 1 is standard, and succeeds as we’d expect. Similarly as expected, Statement 2 fails with a constraint violation as there’s no row in PrimaryTable that has that combination of values.

What about Statement 3? If there was no foreign key, (0, NULL) would be an allowed combination of values, but what happens when there is a foreign key?

The answer is that the insert still succeeds.

I had expected it would fail just like Statement 2, because there isn’t a (0, NULL) combination in PrimaryTable. In fact, it’s not even an allowed combination to put in that table at all!

What happens is that NULL values are completely ignored from checking, and the constraint is satisfied as long as the non-NULL values match up to at least 1 row in the primary table. Edit: this is incorrect. If there are any NULLs in the foreign table columns, the constraint is not checked at all, which is even more perilous.

Multi-column foreign keys aren’t exactly the best practice in the world as far as schema design goes, but you do encounter them in the wild occasionally. If a database is suspect from an integrity point-of-view (which usually means it has many NULLable columns), be aware of this behaviour when doing data cleanup or querying. The corollary is that if this kind of schema is found in a database, poor integrity may not be very far away either.

Add this to your own Blitz script:

SELECT
	fk.name AS ConstraintName,
	COUNT(*) AS ColumnCount
	FROM sys.foreign_keys fk
	INNER JOIN sys.foreign_key_columns fkc ON fkc.constraint_object_id = fk.object_id
	GROUP BY fk.name
	HAVING COUNT(*) >= 2;