6

I have found that when a table has both clustered and non-clustered indexes (on different columns), the leaf level non-clustered pages, instead of pointing to the data row, point to the node of the clustered index, from where another search is instituted to find the data row. What is the point of this extra level of indirection? If the clustered index has, say, 8 levels, then the indirection from the NCI leaf page to the CI root would have to traverse through these 8 levels to reach the data. Why not store the normal RID in the NCI leaf page so that we can access the data at once without going through the CI index structure?

marc_s
  • 8,932
  • 6
  • 45
  • 51

2 Answers2

10

The reason for this is that the "fixed" physical location of your row - the RID (or row identifier) might (and will!) change over time - think page splits that occur when a row needs to be inserted into a table on a page that's already full.

Updating those RIDs in all the nonclustered indices that exist on a given table is quickly becoming both a hassle, and a huge performance killer. You might have 5, 10, 20 nonclustered indices on your table, and SQL Server would have to scan all those indices (basically scanning the whole index, all rows in the index, and that 10, 20 times) and update all RIDs.... that's just not practical - very quickly so.

If you store the value of the clustering index as a "row pointer" instead, that value should typically never change - and most definitely it doesn't need to be updated every time a page is split. Yes, it does involve a second index seek operation - the key lookup - but for simple scenarios, retrieving a single row or a few rows, that's still much more efficient than anything else.

jcolebrand
  • 6,354
  • 4
  • 42
  • 67
marc_s
  • 8,932
  • 6
  • 45
  • 51
  • Ok, got that. So basically you are saying that for a NCI, multiple index rows will have to be updated with the pointers to the newly fragmented page if a page split occurs (these may be huge if the number of rows in a page on average is large), whereas for a CI index, if a fragmentation occurs, only the next address of the old page and the newly formed page have to be updated. – NedStarkOfWinterfell Sep 01 '12 at 12:44
  • 2
    If you NCI stored the physical address of the actual data row, then anytime the data row moves (e.g. through a page split), ALL NCI on that table would have to be updated. When a page split occurs, the CI value does NOT CHANGE and no update is needed in the NCI. HUGE improvement! – marc_s Sep 01 '12 at 12:49
  • Two questions, when a page split occurs, why should all the NIC's need to be updated? Only the RID's in the parent pages that point to the now fragmented pages should be updated, no? And for a CI, if a page split occurs, won't the addresses stored in the doubly linked list for that page need to be updated to point to the newly created page? – NedStarkOfWinterfell Sep 01 '12 at 12:57
  • @Cupidvogel: if you have 10 NCI on a table, each one of them will basically point to any of the data pages. So if that page holding the data is split, all NCI will need to be updated. Agreed? And if the page is split, then yes - the index structure of the clustered index will need to be updated - but only few pages - the parent page, the previous and next sibling pages - that's all - no NCI need to be updated. Also: there's only one single CI - there cannot be 10 of those to update! – marc_s Sep 01 '12 at 13:18
  • Oh, you meant that all NIC's have to be updated. Sure. I though that you meant all index rows within the NIC have to be updated. That's why I said only the relevant index rows have to be updated. But these will of course add up cumulatively. Another question, the article you pointed to, says that if the NCIs hold row-ids, that's space wastage, for it's useless data. But when the NCIs are pointing to to the node, they all are containing a pointer to the node-id anyway, so where is the economy of space? – NedStarkOfWinterfell Sep 01 '12 at 13:33
  • @marc_S: Just a minor comment. "there's only one single CI", usually. I've heard of one particular DBMS (TokuDB) that has multiple CIs. – ypercubeᵀᴹ Sep 02 '12 at 08:05
  • @ypercube: thanks for the heads-up - but this is about SQL Server, and that only has one (at most) CI – marc_s Sep 02 '12 at 21:28
  • @marc_s: Sorry, the question is only tagged with [sql]. I guess you have discussed with the OP before. – ypercubeᵀᴹ Sep 02 '12 at 22:17
  • The point about RIDs needs to be qualified, because what's described never happens in SQL Server. If a table is a heap and a base row physically moves, a forwarding record is created in its place to point to the new location. The new location has a pointer back to the original location so a chain can be avoided on subsequent moves. While this avoids having to update all the NCIs, base table access has to traverse the forwarding to get to the data -- there is a big performance hit to this, but not for the reason of having to update all NCIs. Again, this only applies to a table that is a heap. – Jon Seigel Sep 04 '12 at 02:24
7

In simple terms, it involves less processing and movement of NC index entries when data in the clustered index physically moves (row forwarding, page splits, INSERTs etc).

Mostly the clustered index entries only need changed: not the NC index pointers. By using RIDs, you'd need to do a lot more work on the NC indexes.

To minimize this lookup in a query, you'd make the NC indexes "covering". See https://stackoverflow.com/questions/1395275/which-is-better-bookmark-key-lookup-or-index-scan/1395322#1395322 for example

gbn
  • 69,809
  • 8
  • 163
  • 243