So I started documenting my observations on GUIDs and how they are used in databases in a previous entry and it’s time to get back to this topic. The original entry which discusses what GUIDs are and the model they typically follow can be found here: http://nerdnotes.net/blog/PermaLink,guid,0ab98866-4d22-4874-bd1d-10ab6fa483b6.aspx
This entry is going to focus on why it’s bad to use a GUID as the primary key on a SQL Table. As I noted in my previous entry, one of the characteristics of GUIDs that make them unique is that they are generated randomly. On the other hand a primary key in SQL Server uses a clustered index. A clustered index describes an index where each of the entries in the index is kept in sorted order. Specifically the idea is that since entries are inserted into the index in sorted order, it is possible to very quickly find an entry without needing to scan the entire index.
Key to this discussion is an understanding of the implications of a clustered index. As with any index in a database, it takes extra time to add an entry into an index. The advantage of an index is that they speed up lookup on a table. Sometimes, indexes are unique (such as the one for the primary key) and in some cases they are clustered. Each of these requirements can slow the amount of time needed to add the latest entry to that index. The result is that when we work with a table and we start making decisions on what should and shouldn’t be indexed we are looking to balance the cost of doing an insertion or update vs. the speed of doing a lookup.
Of the indices you place on your table, the one with the most direct impact on the speed of your database's query is the clustered index which is by default on the primary key. It is possible to create a primary key that doesn’t use a clustered index, but this is a rare event and not even considered by most DBAs and developers. What's important however is the unique characteristic of a clustered index. Clustered indices are sorted, that is each entry is physically ordered based on it’s sort order. This is where the difference between an identity/integer based primary key and a GUID based key becomes vital.
The way that GUIDs are generated randomly means that unlike an identity column where the next value always sorts after the preceding value, GUIDs are random. Thus when you need to add this new entry to your clustered index, instead of SQL Server appending the latest entry to the bottom of the index which is a quick and relatively painless operation, it is constantly attempting to insert your new entry into the middle of the index. Doesn’t sound like a big deal, well it is because it triggers block splits and a lot of additional information. Not only that but a 4 or 8 byte integer value takes up a lot less space then does a 64 byte GUID.
Yes, space, not something we normally worry about anymore, and from the standpoint of your database as a whole it’s not an issue. It's an issue because of the fact that you are putting this value in an index. The whole point of an index is that multiple entries can be quickly accessed because each entry in the index is small, but with GUID values aren't small.
So this accounts for two easy to verify technical reasons why GUIDs are a poor choice for use as the primary key on your table. Other reasons include the fact that since SQL Server and .NET v1.1 have slightly different physical representations for GUIDs you are constantly converting between the formats, and the fact that identity values are much more human readable in a test environment as opposed to GUIDs.
But aside from the arguments against GUIDs what are the arguments for GUIDs. Well the first is that since behind the scenes SQL Server applies a row identifier in the form of a GUID you are actually saving space by re-using that row GUID for your primary key. Of course the fallacy here is that just because SQL Server has assigned a value behind the scenes to the row doesn’t mean you are using that value in your index. As a result your indexing is still more expensive.
More importantly is a common security issue in web based applications. HTML/CGI developers and their future counterparts in ASP and ASP.NET need a way to identify the user between HTTP requests. The challenge here is that each HTTP request needs to carry an identifier for the current record in the database, so there is a huge temptation to use the primary key for that record. The problem is that if you are using identity columns and are passing those integer values, it is very easy for a hacker to manipulate that integer value by adding or subtracting 1 and getting someone else’s record. This is VERY bad.
So one way to resolve this was to use GUID values as the primary key. This way a hacker couldn’t just shift a digit and get another record, of course they could start guessing at the correct ID for another user or record, but the odds of guessing correctly are incredibly small. Unfortunately this solution still has a flaw, which is that once a hacker has that identity value even though it’s a GUID security is still compromised. Fortunately there is a simple and much better solution for .NET developers. The Session ID allows you to associate each request with data maintained on the server.
Session values are temporary and as a result resolve the security issues seen with passing identity columns across the web. The only remaining reason (that I know of off the top of my head) for using GUIDs has to do with resolving differences across multiple independent databases. This is another large topic, and so I’m going to put off that discussion until a future post, but suffice it to say for the current discussion that this remaining reason has nothing to do with transaction databases. Unlike a database which is being used to hold records from multiple different unique sources your transaction database doesn’t need to be concerned with integrating entries from another source… but as I said that’s a topic for another post.