Is a clustered index on a child table in a parent/child relationship the most optimal index?

Question

A clustered index will sort row data in a table according to the ordering of the fields in the index. If so, why would you (or would you not) want to create a clustered index on the child table in a parent/child relationship?

Lets assume the classical order-with-orderlines example where you never fetch the orderlines without first going through the associated order. I would think that having a clustered index on both the orderid and the orderlineid column (as clustered indexes must be unique) forces all orderlines for a single order to be physically located adjacent to each other, which would make reads very performant. Is this correct? And if so, are there downsides to this approach?

They do not HAVE to be, but if they are not they would be made unique internally by SQL Server (which is the database I'm working with so YMMV for other systems...). There is some overhead for this so I would try to avoid it. See: http://stackoverflow.com/questions/4332982/do-clustered-indexes-have-to-be-unique — JDT, Feb 04 '15 at 22:01

score 4 · Accepted Answer · answered Feb 06 '15 at 14:37

4

You assumption about adjacency is correct.

If we use TPC-H as an example: Clustering the LINEITEMS table on on ORDERID will locate all order lines belonging to the same LINEITEM physically adjacent on disk. This speeds up queries that fetch all order lines for a given ORDERID. Clustering on the foreign key to the parent also allow fast merge joins between the child and parent.

There are a few downsides to the clustering approach:

The entire table must be kept sorted on disk. If you are expecting a great many inserts with ORDERID not being sequentially generated, page splits will be more expensive. This is something you can throw hardware at.
If ORDERID is generated sequentially, you will create a hotspot at the end of the table. In some database engines (For example SQL Server) this is a problem at high insert speed. In SQL Server, this typically kicks in around 5K-10K inserts/sec.
The cluster index keys either have to be unique (ex: ORDERID, LINENUMBER) or padded with some hidden column to make them unique. Since the composite cluster key must be present in all other indexes, this makes the secondary, non-clustered indexes larger.
Storing the table clustered will force a B-tree traversal when you want to locate data via a secondary index (unless the secondary index is covering the query). If you instead kept the table as a heap, all other indexes would only have an 8B overhead and your B-tree traversals are cut in half.

The vast majority of cases, you will want to cluster both the parent and the child on the same leading key. But if you expect the child table to be accessed via many different indexes - it may be worth considering the alternatives.

answered Feb 06 '15 at 14:37

Thomas Kejser

6,208
2
22
46

My understanding is that the "clustered index hotspot" problem is not real in modern versions of MS SQL (http://dba.stackexchange.com/a/1586/1896). – Jon of All Trades Feb 06 '15 at 15:31
It very much IS a problem in even the latest versions. The problem moved from a locking to a latching issue. See this: http://kejser.org/boosting-insert-speed-by-generating-scalable-keys/ – Thomas Kejser Feb 06 '15 at 16:54
Hotspots are generally a good thing rather than a bad thing. Most workloads won't be doing singleton 10K inserts/sec into a table and, if so, should be using bulk insert instead. – Dan Guzman Feb 07 '15 at 15:58
@DanGuzman: How are hotspots a good thing? You are right that most workloads don't start out with 10K inserts/sec. The problem occurs when you grown to that speed and find yourself trying to manage a huge table with an IDENTITY column that no longer scales – Thomas Kejser Feb 07 '15 at 21:39
@Thommas, I completely agree with your answer except for the hotspot part. A clustered index on an incremental key minimizes I/O and provides linear insert performance regardless of table size as well as better temporal reference locality. Most workloads aren't going to run into the page latch contention (hotspot) issue. – Dan Guzman Feb 07 '15 at 22:48
@DanGuzman: It does NOT provide linear performance regardless of size. In fact, you bump into a scale cliff. Both me and Klaus Assenbrenner has conclusively proven this. Here:http://kejser.org/boosting-insert-speed-by-generating-scalable-keys/ and here: http://www.sqlpassion.at/archive/2014/04/15/an-ever-increasing-clustered-key-value-doesnt-scale/... And if you want it straigth from Microsoft, see this http://www.microsoft.com/en-gb/download/details.aspx?id=26665 (page 13). That Clustered indexes with sequential keys DOESN'T scale is a settled question at this point. – Thomas Kejser Feb 08 '15 at 10:46
@ThomasKejser, I'd like to continue this discussion with a chat. One can demonstrate the exact opposite behavior as well (http://www.dbdelta.com/improving-uniqueidentifier-performance/). With all things SQL Server, there are seldom absolutes. The answer is nearly always "it depends". – Dan Guzman Feb 08 '15 at 14:55
Happy to talk though it over chat. Just ping me when you find me here Dan. – Thomas Kejser Feb 08 '15 at 16:41
@ThomasKejser, http://chat.stackexchange.com/rooms/info/20975/sql-server-incremental-versus-random-keys – Dan Guzman Feb 09 '15 at 12:51
ThomasKejser and @DanGuzman looks like the debate never happened? I always considered a high-insert table like logs to be best indexed on a sequential index. A guid seemed best for indexes where the row represented basically statistically random data... like the names of people registering. theoretically that type of data is statistically evenly distributed, unless of course you went to a Bob's only convention. – enorl76 Jul 26 '18 at 20:58
1

@enorl76, the upshot of our discussion is it depends :-) Random keys avoid latch contention at very high insert rates. The storage system must be capable of sustaining more IOPS with large tables and random keys compared to same with incremental keys. Incremental keys with large tables perform better than random keys with a slow storage system because less IO is needed. Given SSDs are becoming commonplace and more RAM available, random keys are more palatable nowadays than in the past. – Dan Guzman Jul 27 '18 at 01:56

Is a clustered index on a child table in a parent/child relationship the most optimal index?

1 Answers1