40

After asking this question comparing sequential and non-sequential GUIDs, I tried to compare the INSERT performance on 1) a table with a GUID primary key initialized sequentially with newsequentialid(), and 2) a table with an INT primary key initialized sequentially with identity(1,1). I would expect the latter to be fastest because of the smaller width of integers, and it also seems simpler to generate a sequential integer than a sequential GUID. But to my surprise, INSERTs on the table with the integer key were significantly slower than the sequential GUID table.

This shows the average time usage (ms) for the test runs:

NEWSEQUENTIALID()  1977
IDENTITY()         2223

Can anyone explain this?

The following experiment was used:

SET NOCOUNT ON

CREATE TABLE TestGuid2 (Id UNIQUEIDENTIFIER NOT NULL DEFAULT NEWSEQUENTIALID() PRIMARY KEY,
SomeDate DATETIME, batchNumber BIGINT, FILLER CHAR(100))

CREATE TABLE TestInt (Id Int NOT NULL identity(1,1) PRIMARY KEY,
SomeDate DATETIME, batchNumber BIGINT, FILLER CHAR(100))

DECLARE @BatchCounter INT = 1
DECLARE @Numrows INT = 100000


WHILE (@BatchCounter <= 20)
BEGIN 
BEGIN TRAN

DECLARE @LocalCounter INT = 0

    WHILE (@LocalCounter <= @NumRows)
    BEGIN
    INSERT TestGuid2 (SomeDate,batchNumber) VALUES (GETDATE(),@BatchCounter)
    SET @LocalCounter +=1
    END

SET @LocalCounter = 0

    WHILE (@LocalCounter <= @NumRows)
    BEGIN
    INSERT TestInt (SomeDate,batchNumber) VALUES (GETDATE(),@BatchCounter)
    SET @LocalCounter +=1
    END

SET @BatchCounter +=1
COMMIT 
END

DBCC showcontig ('TestGuid2')  WITH tableresults
DBCC showcontig ('TestInt')  WITH tableresults

SELECT batchNumber,DATEDIFF(ms,MIN(SomeDate),MAX(SomeDate)) AS [NEWSEQUENTIALID()]
FROM TestGuid2
GROUP BY batchNumber

SELECT batchNumber,DATEDIFF(ms,MIN(SomeDate),MAX(SomeDate)) AS [IDENTITY()]
FROM TestInt
GROUP BY batchNumber

DROP TABLE TestGuid2
DROP TABLE TestInt

UPDATE: Modifying the script to perform the insertions based on a TEMP table, like in the examples by by Phil Sandler, Mitch Wheat and Martin below, I also find that IDENTITY is faster as it should be. But that is not the conventional way of inserting rows, and I still do not understand why the experiment went wrong at first: even if I omit GETDATE() from my original example, IDENTITY() is still way slower. So it seems that the only way to make IDENTITY() outperform NEWSEQUENTIALID() is to prepare the rows to insert in a temporary table and perform the many insertions as a batch-insert using this temp table. All in all, I don't think we have found an explanation to the phenomenon, and IDENTITY() still seems to be slower for most practical usages. Can anyone explain this?

someName
  • 591
  • 1
  • 4
  • 5
  • 4
    A thought only: Could it be that generating a new GUID can be done without involving the table at all, whereas getting the next available identity value introduces some kind of lock temporarily to ensure two threads/connections won't get the same value? I'm just guessing really. Interesting question! – Lasse V. Karlsen May 13 '11 at 22:06
  • 4
    Who says they do?? There's a lot of evidence they don't - see Kimberly Tripp's Disk space is cheap - that's NOT the point! blog post - she does quite an extensive review, and GUIDs always loose out clearly to INT IDENTITY – marc_s May 13 '11 at 22:08
  • 2
    Well, the experiment above shows the contrary, and the results are repeatable. – someName May 13 '11 at 22:28
  • @marc_s What is "they", "don't" (don't [do] what?) What specifically are you commenting about? – Lasse V. Karlsen May 13 '11 at 22:33
  • 2
    Using IDENTITY doesn't require a table lock. Conceptually I could see you might expect it to be taking MAX(id) + 1, but in reality the next value is stored. It should actually be faster than finding the next GUID. –  May 13 '11 at 23:37
  • 1
    I notice that with a set based insert of 100000 rows the difference more or less disappears. Wonder if there is additional logging needed for each identity based insert. Seems to be implied from here http://blogs.msdn.com/b/sqlprogrammability/archive/2006/04/04/567724.aspx – Martin Smith May 14 '11 at 00:05
  • 2
    And my tests with SELECT COUNT(*) FROM sys.fn_dblog(NULL,NULL) indicate that inserting 100,000 rows into the table with the guid column generates about 113K log records whereas for the id column the number is more like 212K. i.e. nearly double the amount of logging is required for the identity column insert. These figures are the same regardless of whether the rows are inserted one by one or in bulk though. – Martin Smith May 14 '11 at 00:59
  • 4
    Also, presumably the filler column for the TestGuid2 table should be CHAR(88) to make the rows equal size – Mitch Wheat May 14 '11 at 01:16

6 Answers6

19

I modified @Phil Sandler's code to remove the effect of calling GETDATE() (there may be hardware effects/interrupts involved??), and made rows the same length.

[There have been several articles since SQL Server 2000 relating to timing issues and high-resolution timers, so I wanted to minimise that effect.]

In simple recovery model with data and log file both sized way over what is required, here are the timings (in seconds): (Updated with new results based on exact code below)

       Identity(s)  Guid(s)
       ---------    -----
       2.876        4.060    
       2.570        4.116    
       2.513        3.786   
       2.517        4.173    
       2.410        3.610    
       2.566        3.726
       2.376        3.740
       2.333        3.833
       2.416        3.700
       2.413        3.603
       2.910        4.126
       2.403        3.973
       2.423        3.653
    -----------------------
Avg    2.650        3.857
StdDev 0.227        0.204

The code used:

SET NOCOUNT ON

CREATE TABLE TestGuid2 (Id UNIQUEIDENTIFIER NOT NULL DEFAULT NEWSEQUENTIALID() PRIMARY KEY,
SomeDate DATETIME, batchNumber BIGINT, FILLER CHAR(88))

CREATE TABLE TestInt (Id Int NOT NULL identity(1,1) PRIMARY KEY,
SomeDate DATETIME, batchNumber BIGINT, FILLER CHAR(100))

DECLARE @Numrows INT = 1000000

CREATE TABLE #temp (Id int NOT NULL Identity(1,1) PRIMARY KEY, rowNum int, adate datetime)

DECLARE @LocalCounter INT = 0

--put rows into temp table
WHILE (@LocalCounter < @NumRows)
BEGIN
    INSERT INTO #temp(rowNum, adate) VALUES (@LocalCounter, GETDATE())
    SET @LocalCounter += 1
END

--Do inserts using GUIDs
DECLARE @GUIDTimeStart DateTime = GETDATE()
INSERT INTO TestGuid2 (SomeDate, batchNumber) 
SELECT adate, rowNum FROM #temp
DECLARE @GUIDTimeEnd  DateTime = GETDATE()

--Do inserts using IDENTITY
DECLARE @IdTimeStart DateTime = GETDATE()
INSERT INTO TestInt (SomeDate, batchNumber) 
SELECT adate, rowNum FROM #temp
DECLARE @IdTimeEnd DateTime = GETDATE()

SELECT DATEDIFF(ms, @IdTimeStart, @IdTimeEnd) AS IdTime, DATEDIFF(ms, @GUIDTimeStart, @GUIDTimeEnd) AS GuidTime

DROP TABLE TestGuid2
DROP TABLE TestInt
DROP TABLE #temp
GO

After reading @Martin's investigation, I re-ran with the suggested TOP(@num) in both cases, i.e.

...
--Do inserts using GUIDs
DECLARE @num INT = 2147483647; 
DECLARE @GUIDTimeStart DATETIME = GETDATE(); 
INSERT INTO TestGuid2 (SomeDate, batchNumber) 
SELECT TOP(@num) adate, rowNum FROM #temp; 
DECLARE @GUIDTimeEnd DATETIME = GETDATE();

--Do inserts using IDENTITY
DECLARE @IdTimeStart DateTime = GETDATE()
INSERT INTO TestInt (SomeDate, batchNumber) 
SELECT TOP(@num) adate, rowNum FROM #temp;
DECLARE @IdTimeEnd DateTime = GETDATE()
...

and here are the timing results:

       Identity(s)  Guid(s)
       ---------    -----
       2.436        2.656
       2.940        2.716
       2.506        2.633
       2.380        2.643
       2.476        2.656
       2.846        2.670
       2.940        2.913
       2.453        2.653
       2.446        2.616
       2.986        2.683
       2.406        2.640
       2.460        2.650
       2.416        2.720

    -----------------------
Avg    2.426        2.688
StdDev 0.010        0.032

I wasn't able to get the actual execution plan, as the query never returned! It seems a bug is likely. (Running Microsoft SQL Server 2008 R2 (RTM) - 10.50.1600.1 (X64))

Mitch Wheat
  • 1,365
  • 9
  • 17
  • 7
    Neatly illustrates the critical element of good benchmarking: Make sure you're only measuring one thing at a time. – Aaronaught May 14 '11 at 02:29
  • What plan do you get here? Does it have a SORT operator for the GUIDs? – Martin Smith May 14 '11 at 03:02
  • @Martin: Hi, I didn't check the plans (doing a few things at once :) ). I'll have a look a bit later... – Mitch Wheat May 14 '11 at 03:05
  • @Mitch - Any feedback on this? I rather suspect the main thing you are measuring here is the time taken to sort the guids for large inserts which while interesting doesn't answer the OP's original question which was about giving an explanation as to why sequential guids performed better than identity columns on single row inserts in the OP's testing. – Martin Smith May 14 '11 at 13:24
  • INT is 4 bytes wide while GUID is 16. So in order to make the rows of equal length, the FILLER column should be 88 bytes for TestGuid2 and 100 bytes for TestInt. Your example switch these widths... How do your results look if you do it the other way round? – someName May 15 '11 at 00:00
  • @someName: oops! I edited the code in 2 windows while tinkering. I'll re-run results to make sure the code I meant to post was the code run!.... – Mitch Wheat May 15 '11 at 01:23
  • @Mitch - If you get a chance to add in a third test case I'll guarantee you see the difference in performance disappear. DECLARE @num INT = 2147483647; DECLARE @GUIDTimeStart1 DATETIME = GETDATE(); INSERT INTO TestGuid (SomeDate, batchNumber) SELECT TOP(@num) adate, rowNum FROM #temp; DECLARE @GUIDTimeEnd1 DATETIME = GETDATE(); (this will remove the unnecessary sort from the plan that has no benefit whatsoever) – Martin Smith May 15 '11 at 01:49
  • @Martin: OK will give it a go.... – Mitch Wheat May 15 '11 at 02:00
  • 2
    @Mitch - Although the more I think about it the less I understand why anyone would ever want to use NEWSEQUENTIALID anyway. It will make the index deeper, use 20% more data pages in the OP's case and is only guaranteed to be ever increasing until the machine is rebooted so has a lot of disadvantages over an identity. It just seems in this case that the Query Plan adds in a further unnecessary one! – Martin Smith May 15 '11 at 02:07
  • @Martin: BTW, kudos to you for pursuing this. It might be worth raising as a Connect item. Interestingly, if I try to display the actual execution plan, it never returns!!! (I'm using Microsoft SQL Server 2008 R2 (RTM) - 10.50.1600.1 (X64) ) I've re-run some timings with TOP(@num) for both. I'll add to my post – Mitch Wheat May 15 '11 at 02:40
  • +1 Interesting results, Both table definitions will get 62 rows per data page so I presume that the consistent better performance is due to the narrower key meaning fewer pages (and page splits) at the upper levels of the index (and fewer levels as well as the number of rows grows) – Martin Smith May 15 '11 at 11:19
  • @Mitch: Was the code you pasted exactly what you executed? As far as I can see, the inserts aren't encapsulated in a single transaction block, and thus there will be a COMMIT after each INSERT. Another (but minor) detail is, that the batchNumber is just being set to the row number, resulting in N batches for an N-row insertion. In my own example I used the batch number to hold the number of each N-row insertion group. But that's not important here. :-) – someName May 15 '11 at 19:16
  • Now that I think about it: it seems odd that the only way we can tweak IDENTITY() to be faster is in the special case where all rows are prepared in a temporary table. But that doesn't reflect how row insertions are normally done. Can we conclude that, under normal circumstances with conventional use of INSERT, IDENTITY() is outperformed by NEWSEQUENTIALID()? – someName May 16 '11 at 07:10
19

On a fresh database in simple recovery model with the data file sized at 1GB and the log file at 3GB (laptop machine, both files on the same drive) and recovery interval set to 100 minutes (to avoid a checkpoint skewing the results) I see similar results to you with the single row inserts.

I tested three cases: For each case I did 20 batches of inserting 100,000 rows individually into the following tables. The full scripts can be found in this answer's revision history.

CREATE TABLE TestGuid
  (
     Id          UNIQUEIDENTIFIER NOT NULL DEFAULT NEWSEQUENTIALID() PRIMARY KEY,
     SomeDate    DATETIME, batchNumber BIGINT, FILLER CHAR(100)
  )

CREATE TABLE TestId
  (
     Id          Int NOT NULL identity(1, 1) PRIMARY KEY,
     SomeDate    DATETIME, batchNumber BIGINT, FILLER CHAR(100)
  )

CREATE TABLE TestInt
  (
     Id          Int NOT NULL PRIMARY KEY,
     SomeDate    DATETIME, batchNumber BIGINT, FILLER  CHAR(100)
  )  

For the third table the test inserted rows with an incrementing Id value but this was self calculated by incrementing the value of a variable in a loop.

Averaging the time taken across the 20 batches gave the following results.

NEWSEQUENTIALID() IDENTITY()  INT
----------------- ----------- -----------
1999              2633        1878

Conclusion

So it definitely appears to be overhead of the identity creation process that is responsible for the results. For the self calculated incrementing integer then the results are much more in-line with what would be expected to see when considering only the IO cost.

When I put the insert code described above into stored procedures and review sys.dm_exec_procedure_stats it gives the following results

proc_name      execution_count      total_worker_time    last_worker_time     min_worker_time      max_worker_time      total_elapsed_time   last_elapsed_time    min_elapsed_time     max_elapsed_time     total_physical_reads last_physical_reads  min_physical_reads   max_physical_reads   total_logical_writes last_logical_writes  min_logical_writes   max_logical_writes   total_logical_reads  last_logical_reads   min_logical_reads    max_logical_reads
-------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- --------------------
IdentityInsert 20                   45060360             2231067              2094063              2645079              45119362             2234067              2094063              2660080              0                    0                    0                    0                    32505                1626                 1621                 1626                 6268917              315377               276833               315381
GuidInsert     20                   34829052             1742052              1696051              1833055              34900053             1744052              1698051              1838055              0                    0                    0                    0                    35408                1771                 1768                 1772                 6316837              316766               298386               316774

So in those results total_worker_time is about 30% higher. This represents

Total amount of CPU time, in microseconds, that was consumed by executions of this stored procedure since it was compiled.

So it simply appears as though the code that generates the IDENTITY value is more CPU intensive than that which generates the NEWSEQUENTIALID() (The difference between the 2 figures is 10231308 which averages out at about 5µs per insert.) and that for this table definition this fixed CPU cost was sufficiently high to outweigh the additional logical reads and writes incurred due to the greater width of the key. (NB: Itzik Ben Gan did similar testing here and found a 2µs penalty per insert)

So why is IDENTITY more CPU intensive than UuidCreateSequential?

I believe this is explained in this article. For every tenth identity value generated, SQL Server has to write the change to the system tables on disk

What about MultiRow Inserts?

When the 100,000 rows are inserted in a single statement I found the difference disappeared with still perhaps a slight benefit to the GUID case but nowhere near as clear cut results. The average for 20 batches in my test was

NEWSEQUENTIALID() IDENTITY()
----------------- -----------
1016              1088

The reason that it doesn't have the penalty apparent in Phil's code and Mitch's first set of results is because it so happened that the code I used to do the multi row insert used SELECT TOP (@NumRows). This prevented the optimiser from correctly estimating the number of rows that will be inserted.

This seems to be of benefit as there is a certain tipping point at which it will add an additional sort operation for the (supposedly sequential!) GUIDs.

GUID Sort

This sort operation is not required from the explanatory text in BOL.

Creates a GUID that is greater than any GUID previously generated by this function on a specified computer since Windows was started. After restarting Windows, the GUID can start again from a lower range, but is still globally unique.

So it seemed to me a bug or missing optimisation that SQL Server does not recognise that the output of the compute scalar will already be pre-sorted as it apparently already does for the identity column. (Edit I reported this and the unnecessary sort issue is now fixed in Denali )

Martin Smith
  • 84,644
  • 15
  • 245
  • 333
8

Quite simple: with GUID, it is cheaper to generate the next number in the line than it is for IDENTITY (The current value of the GUID does not have to be stored, the IDENTITY has to be). This is true even for NEWSEQUENTIALGUID.

You could make the test more fair and use a SEQUENCER with a large CACHE - which is cheaper than IDENTITY.

But as M.R. says, there are some major advantages to GUIDs. As a matter of fact, they are MUCH more scalable than IDENTITY columns (but only if they are NOT sequential).

See: http://blog.kejser.org/2011/10/05/boosting-insert-speed-by-generating-scalable-keys/

Thomas Kejser
  • 6,208
  • 2
  • 22
  • 46
4

I'm fascinated by this type of question. Why did you have to post it on a Friday night? :)

I think even if your test is ONLY intended to measure INSERT performance, you (may) have introduced a number of factors that could be misleading (looping, a long-running transaction, etc.)

I'm not completely convinced my version proves anything, but identity does perform better than the GUIDs in it (3.2 seconds vs 6.8 seconds on a home PC):

SET NOCOUNT ON

CREATE TABLE TestGuid2 (Id UNIQUEIDENTIFIER NOT NULL DEFAULT NEWSEQUENTIALID() PRIMARY KEY,
SomeDate DATETIME, batchNumber BIGINT, FILLER CHAR(100))

CREATE TABLE TestInt (Id Int NOT NULL identity(1,1) PRIMARY KEY,
SomeDate DATETIME, batchNumber BIGINT, FILLER CHAR(100))

DECLARE @Numrows INT = 1000000

CREATE TABLE #temp (Id int NOT NULL Identity(1,1) PRIMARY KEY, rowNum int)

DECLARE @LocalCounter INT = 0

--put rows into temp table
WHILE (@LocalCounter < @NumRows)
BEGIN
    INSERT INTO #temp(rowNum) VALUES (@LocalCounter)
    SET @LocalCounter += 1
END

--Do inserts using GUIDs
DECLARE @GUIDTimeStart DateTime = GETDATE()
INSERT INTO TestGuid2 (SomeDate, batchNumber) 
SELECT GETDATE(), rowNum FROM #temp
DECLARE @GUIDTimeEnd  DateTime = GETDATE()

--Do inserts using IDENTITY
DECLARE @IdTimeStart DateTime = GETDATE()
INSERT INTO TestInt (SomeDate, batchNumber) 
SELECT GETDATE(), rowNum FROM #temp
DECLARE @IdTimeEnd DateTime = GETDATE()

SELECT DATEDIFF(ms, @IdTimeStart, @IdTimeEnd) AS IdTime
SELECT DATEDIFF(ms, @GUIDTimeStart, @GUIDTimeEnd) AS GuidTime

DROP TABLE TestGuid2
DROP TABLE TestInt
DROP TABLE #temp
Phil Sandler
  • 465
  • 1
  • 6
  • 19
  • The other factor that no one has mentioned is database recovery model, and log file growths... – Mitch Wheat May 14 '11 at 01:41
  • @Mitch on a new database in simple recovery model with data and log file both sized way over what is required I get similar results to the OP. – Martin Smith May 14 '11 at 01:43
  • I just got timings of 2.560 seconds for Identity, and 3.666 seconds for Guid (in simple recovery model with data and log file both sized way over what is required) – Mitch Wheat May 14 '11 at 01:44
  • @Mitch - On the OP's code with it all in the same transaction or on Phil's code? – Martin Smith May 14 '11 at 01:45
  • on this posters code, that's why I'm commenting here. I've also posted the code I used... – Mitch Wheat May 14 '11 at 01:45
  • @Mitch - I get the same type of result with this code. The GUID one gives me a parallel plan where it sorts the Guids. This is skipped for the identity one. Wonder why it can't just generate the GUIDs sequentially as advertised so this step gets skipped! – Martin Smith May 14 '11 at 01:56
  • I can't get an execution plan to show up on my machine for some reason, unless I set the @Numrows to 100 or less. I wonder what causes the sort? @Martin: I'm not clear on what makes a difference in your code and mine. The only thing I can think of is the distribution of data, since the data your statements insert are always the same? –  May 14 '11 at 03:27
  • @Phil - Populate the temp table first then run the rest of the code in a separate step to get the plans. The reason for the difference is because it has an accurate estimate of the number of rows for yours whereas for mine it doesn't. But the additional sort operator added in seems counter productive in practice. – Martin Smith May 14 '11 at 10:16
  • @Phil: It seems strange, that the only way we can tweak IDENTITY() to actually be faster than NEWSEQUENTIALID() is to prepare all in a temporary table and perform the insertions in a batch afterwards. I fail to see why the speed of IDENTITY()-insertions is so low under "normal" (non-batch) circumstances. After all, preparing the data in a temporary table like this is not a common-practice way of performing insertions - so in the general, and most typical case, IDENTITY() is not a good choice. Can anyone explain what happens here? – someName May 16 '11 at 06:40
3

I ran your sample script several times making a few tweaks to batch count and size (and thank you very much for providing it).

First I'll say that you're only measuring once aspect of the keys' performance - INSERT speed. So unless you're specifically concerned only with getting data into the tables as quickly as possible there's much more to this animal.

My findings were in general similar to yours. However, I would mention that variance in INSERT speed between GUID and IDENTITY (int) is slightly larger with GUID than with IDENTITY - maybe +/- 10% between runs. The batches that used IDENTITY varied less than 2 - 3% each time.

Also to note, my test box is clearly less powerful than yours so I had to use smaller row counts.

Yuck
  • 238
  • 3
  • 11
  • When the PK is a GUID is it possible that the engine uses not an index but a hashing algorithm to determine the physical location of the corresponding record? Inserts into a sparse table with hashed primary keys are always faster than inserts into a table with an index on the primary key because of the absence of the index overhead. It's just a question -- don't vote me down if the answer is No. Just supply the link to the authority. –  May 14 '11 at 00:17
1

I'm going to refer back to another conv on stackoverflow for this same topic - https://stackoverflow.com/questions/170346/what-are-the-performance-improvement-of-sequential-guid-over-standard-guid

One thing I do know is that having sequential GUIDs is that the index usage is better due to very little leaf movement, and therefore reducing HD seek. I would think because of this, the inserts would be faster, too, as it doesn't have to distribute the keys over a large number of pages.

My personal experience is that when you are implementing a large high traffic DB, it is better to use GUIDs, because it makes it much more scalable for integration with other systems. That goes for replication, specifically, and int/bigint limits.... not that you would run out of bigints, but eventually you will, and cycle back.

M.R.
  • 221
  • 1
  • 4
  • 8