Composite indexes: Most selective column first?

Question

I’ve been reading about composite indexes and I’m slightly confused about ordering. This documentation (little less than half way down) says

In general, you should put the column expected to be used most often first in the index.

However, shortly after it says

create a composite index putting the most selective column first; that is, the column with the most values.

Oracle also says it here in other words

If all keys are used in WHERE clauses equally often, then ordering these keys from most selective to least selective in the CREATE INDEX statement best improves query performance.

However, I have found a SO answer that says differently. It says

Arrange the columns with the least selective column first and the most selective column last. In the case of a tie lead with the column which is more likely to be used on its own.

The first documentation I referenced says that you should first go by the most often used whereas the SO answer says that should only be for tie breaking. Then they also differ on the ordering.

This documentation also talks about skip scanning and says

Skip scanning is advantageous if there are few distinct values in the leading column of the composite index and many distinct values in the nonleading key of the index.

Another article says

The prefix column should be the most discriminating and the most widely used in queries

which I believe most discriminating would mean most distinctive.

All of this research still leads me to the same question; should The most selective column be first or last? Should the first column be the most used and only the most selective on a tie-break?

These articles seem to be contradicting each other, but they do offer some examples. From what I have gathered, it seems to be more efficient for the least selective column to be the first in the ordering if you are anticipating Index Skip Scans. But I'm not really sure if that is correct.

http://use-the-index-luke.com/sql/myth-directory/most-selective-first — , Jan 24 '17 at 12:19

atokpas · Accepted Answer · 2018-07-20T08:16:06.963

From AskTom

(in 9i, there is a new "index skip scan" -- search for that there to read about that. It makes the index (a,b) OR (b,a) useful in both of the above cases sometimes!)

So, the order of columns in your index depends on HOW YOUR QUERIES are written. You want to be able to use the index for as many queries as you can (so as to cut down on the over all number of indexes you have) -- that will drive the order of the columns. Nothing else (selectivity of a or b does not count at all).

One of the arguments for arranging columns in the composite index in order from the least discriminating(less distinct values) to the most discriminating(more distinct values) is for index key compression.

SQL> create table t as select * from all_objects;

Table created.

SQL> create index t_idx_1 on t(owner,object_type,object_name);

Index created.

SQL> create index t_idx_2 on t(object_name,object_type,owner);

Index created.

SQL> select count(distinct owner), count(distinct object_type), count(distinct object_name ), count(*)  from t;

COUNT(DISTINCTOWNER) COUNT(DISTINCTOBJECT_TYPE) COUNT(DISTINCTOBJECT_NAME)      COUNT(*)
-------------------- -------------------------- --------------------------      ----------
                 30                         45                       52205      89807

SQL> analyze index t_idx_1 validate structure; 

Index analyzed.

SQL> select btree_space, pct_used, opt_cmpr_count, opt_cmpr_pctsave from index_stats;

BTREE_SPACE   PCT_USED OPT_CMPR_COUNT OPT_CMPR_PCTSAVE
----------- ---------- -------------- ----------------
    5085584     90          2           28

SQL> analyze index t_idx_2 validate structure; 

Index analyzed.

SQL> select btree_space, pct_used, opt_cmpr_count, opt_cmpr_pctsave  from index_stats; 

BTREE_SPACE   PCT_USED OPT_CMPR_COUNT OPT_CMPR_PCTSAVE
----------- ---------- -------------- ----------------
    5085584     90          1           14

According to the index statistics, the first index is more compressible.

Another is how the index is used in your queries. If your queries mostly use col1,

For example, if you have queries like-

select * from t where col1 = :a and col2 = :b;

select * from t where col1 = :a;

-then index(col1,col2) would perform better.

If your queries mostly use col2,

select * from t where col1 = :a and col2 = :b;

select * from t where col2 = :b;

-then index(col2,col1) would perform better. If all of your queries always specify both columns then it doesn't matter which column come first in the composite index.

In conclusion, the key considerations in column ordering of composite index are index key compression and how you are going to use this index in your queries.

References:

Column order in Index

It’s Less Efficient To Have Low Cardinality Leading Columns In An Index (Right) ?

Index Skip Scan – Does Index Column Order Matter Any More ? (Warning Sign)

Chris Saxon · Answer 2 · 2024-02-01T18:21:31.110

When choosing index column order, the overriding concern is:

Are there (equality) predicates against this column in my queries?

If a column never appears in a where clause, it's not worth indexing(1)

OK, so you've got a table and queries against each column. Sometimes more than one.

How do you decide what to index?

Let's look at an example. Here's a table with three columns. One holds 10 values, another 1,000, the last 10,000:

create table t(
  few_vals  varchar2(10),
  many_vals varchar2(10),
  lots_vals varchar2(10)
);
insert into t 
with rws as (
  select lpad(mod(rownum, 10), 10, '0'), 
         lpad(mod(rownum, 1000), 10, '0'), 
         lpad(rownum, 10, '0') 
  from dual connect by level <= 10000
)
  select * from rws;
commit;
select count(distinct few_vals),
       count(distinct many_vals) ,
       count(distinct lots_vals) 
from   t;
COUNT(DISTINCTFEW_VALS)  COUNT(DISTINCTMANY_VALS)  COUNT(DISTINCTLOTS_VALS)

10                       1,000                     10,000

These are numbers left padded with zeros. This will help make the point about compression later.

So you've got three common queries:

select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
from   t
where  few_vals = '0000000001';
select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
from   t
where  lots_vals = '0000000001';
select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
from   t
where  lots_vals = '0000000001'
and    few_vals = '0000000001';

What do you index?

An index on just few_vals is only marginally better than a full table scan:

select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
from   t
where  few_vals = '0000000001';
select * 
from table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));

| Id  | Operation            | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
|   0 | SELECT STATEMENT     |          |      1 |        |      1 |00:00:00.01 |      61 |

|   1 |  SORT AGGREGATE      |          |      1 |      1 |      1 |00:00:00.01 |      61 |

|   2 |   VIEW               | VW_DAG_0 |      1 |   1000 |   1000 |00:00:00.01 |      61 |

|   3 |    HASH GROUP BY     |          |      1 |   1000 |   1000 |00:00:00.01 |      61 |

|   4 |     TABLE ACCESS FULL| T        |      1 |   1000 |   1000 |00:00:00.01 |      61 |

select /+ index (t (few_vals)) /
       count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
from   t
where  few_vals = '0000000001';
select * 
from   table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));

| Id  | Operation                              | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
|   0 | SELECT STATEMENT                       |          |      1 |        |      1 |00:00:00.01 |      58 |

|   1 |  SORT AGGREGATE                        |          |      1 |      1 |      1 |00:00:00.01 |      58 |

|   2 |   VIEW                                 | VW_DAG_0 |      1 |   1000 |   1000 |00:00:00.01 |      58 |

|   3 |    HASH GROUP BY                       |          |      1 |   1000 |   1000 |00:00:00.01 |      58 |

|   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| T        |      1 |   1000 |   1000 |00:00:00.01 |      58 |

|   5 |      INDEX RANGE SCAN                  | FEW      |      1 |   1000 |   1000 |00:00:00.01 |       5 |

So it's unlikely to be worth indexing on its own. Queries on lots_vals return few rows (just 1 in this case). So this is definitely worth indexing.

But what about the queries against both columns?

Should you index:

( few_vals, lots_vals )

OR

( lots_vals, few_vals )

Trick question!

The answer is neither.

Sure few_vals is a long string. So you can get good compression out of it. And you (might) get an index skip scan for the queries using (few_vals, lots_vals) that only have predicates on lots_vals. But I don't here, even though it performs markedly better than a full scan:

create index few_lots on t(few_vals, lots_vals);
select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
from   t
where  lots_vals = '0000000001';
select * 
from   table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));

| Id  | Operation            | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
|   0 | SELECT STATEMENT     |          |      1 |        |      1 |00:00:00.01 |      61 |

|   1 |  SORT AGGREGATE      |          |      1 |      1 |      1 |00:00:00.01 |      61 |

|   2 |   VIEW               | VW_DAG_0 |      1 |      1 |      1 |00:00:00.01 |      61 |

|   3 |    HASH GROUP BY     |          |      1 |      1 |      1 |00:00:00.01 |      61 |

|   4 |     TABLE ACCESS FULL| T        |      1 |      1 |      1 |00:00:00.01 |      61 |

select /+ index_ss (t few_lots) /count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
from   t
where  lots_vals = '0000000001';
select * 
from   table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));

| Id  | Operation                              | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
|   0 | SELECT STATEMENT                       |          |      1 |        |      1 |00:00:00.01 |      13 |     11 |

|   1 |  SORT AGGREGATE                        |          |      1 |      1 |      1 |00:00:00.01 |      13 |     11 |

|   2 |   VIEW                                 | VW_DAG_0 |      1 |      1 |      1 |00:00:00.01 |      13 |     11 |

|   3 |    HASH GROUP BY                       |          |      1 |      1 |      1 |00:00:00.01 |      13 |     11 |

|   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| T        |      1 |      1 |      1 |00:00:00.01 |      13 |     11 |

|   5 |      INDEX SKIP SCAN                   | FEW_LOTS |      1 |     40 |      1 |00:00:00.01 |      12 |     11 |

Do you like gambling? (2)

So you still need a an index with lots_vals as the leading column. And at least in this case the compound index (few, lots) does the same amount of work as one on just (lots)

select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
from   t
where  lots_vals = '0000000001'
and    few_vals = '0000000001';
select * 
from   table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));

| Id  | Operation                              | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
|   0 | SELECT STATEMENT                       |          |      1 |        |      1 |00:00:00.01 |       3 |

|   1 |  SORT AGGREGATE                        |          |      1 |      1 |      1 |00:00:00.01 |       3 |

|   2 |   VIEW                                 | VW_DAG_0 |      1 |      1 |      1 |00:00:00.01 |       3 |

|   3 |    HASH GROUP BY                       |          |      1 |      1 |      1 |00:00:00.01 |       3 |

|   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| T        |      1 |      1 |      1 |00:00:00.01 |       3 |

|   5 |      INDEX RANGE SCAN                  | FEW_LOTS |      1 |      1 |      1 |00:00:00.01 |       2 |

create index lots on t(lots_vals);
select /+ index (t (lots_vals)) /count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
from   t
where  lots_vals = '0000000001'
and    few_vals = '0000000001';
select * 
from   table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));

| Id  | Operation                              | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
|   0 | SELECT STATEMENT                       |          |      1 |        |      1 |00:00:00.01 |       3 |      1 |

|   1 |  SORT AGGREGATE                        |          |      1 |      1 |      1 |00:00:00.01 |       3 |      1 |

|   2 |   VIEW                                 | VW_DAG_0 |      1 |      1 |      1 |00:00:00.01 |       3 |      1 |

|   3 |    HASH GROUP BY                       |          |      1 |      1 |      1 |00:00:00.01 |       3 |      1 |

|   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| T        |      1 |      1 |      1 |00:00:00.01 |       3 |      1 |

|   5 |      INDEX RANGE SCAN                  | LOTS     |      1 |      1 |      1 |00:00:00.01 |       2 |      1 |

There will be cases where the compound index saves you 1-2 IOs. But is it worth having two indexes for this saving?

And there's another problem with the composite index. Compare the clustering factor for the three indexes including LOTS_VALS:

create index lots on t(lots_vals);
create index lots_few on t(lots_vals, few_vals);
create index few_lots on t(few_vals, lots_vals);
select index_name, leaf_blocks, distinct_keys, clustering_factor
from   user_indexes
where  table_name = 'T';
INDEX_NAME  LEAF_BLOCKS  DISTINCT_KEYS  CLUSTERING_FACTOR

FEW_LOTS    47           10,000         530

LOTS_FEW    47           10,000         53

LOTS        31           10,000         53

FEW         31           10             530

Notice that the clustering factor for few_lots is 10x higher than for lots and lots_few! And this is in a demo table with perfect clustering to begin with. In real world databases the effect is likely to be worse.

So what's so bad about that?

The clustering factor is one of the key drivers determining how "attractive" an index is. The higher it is, the less likely the optimizer is to choose it. Particularly if lots_vals aren't actually unique, but still normally have few rows per value. If you're unlucky this could be enough to make the optimizer think a full scan is cheaper...

OK, so composite indexes with few_vals and lots_vals only have edge case benefits.

What about queries filtering few_vals and many_vals?

Single columns indexes only give small benefits. But combined they return few values. So a composite index is a good idea. But which way round?

If you place few first, compressing the leading column will make that smaller

create index few_many on t(many_vals, few_vals);
create index many_few on t(few_vals, many_vals);
select index_name, leaf_blocks, distinct_keys, clustering_factor 
from   user_indexes
where  index_name in ('FEW_MANY', 'MANY_FEW');
INDEX_NAME  LEAF_BLOCKS  DISTINCT_KEYS  CLUSTERING_FACTOR

FEW_MANY    47           1,000          10,000

MANY_FEW    47           1,000          10,000
alter index few_many rebuild compress 1;
alter index many_few rebuild compress 1;
select index_name, leaf_blocks, distinct_keys, clustering_factor 
from   user_indexes
where  index_name in ('FEW_MANY', 'MANY_FEW');
INDEX_NAME  LEAF_BLOCKS  DISTINCT_KEYS  CLUSTERING_FACTOR

MANY_FEW    31           1,000          10,000

FEW_MANY    34           1,000          10,000

With fewer different values in the leading column compresses better. So there's marginally less work to read this index. But only slightly. And both are already a good chunk smaller than the original (25% size decrease).

And you can go further and compress the whole index!

alter index few_many rebuild compress 2;
alter index many_few rebuild compress 2;
select index_name, leaf_blocks, distinct_keys, clustering_factor 
from   user_indexes
where  index_name in ('FEW_MANY', 'MANY_FEW');
INDEX_NAME  LEAF_BLOCKS  DISTINCT_KEYS  CLUSTERING_FACTOR

FEW_MANY    20           1,000          10,000

MANY_FEW    20           1,000          10,000

Now both indexes are back to the same size. Note this takes advantage of the fact there's a relationship between few and many. Again it's unlikely you'll see this kind of benefit in the real world.

So far we've only talked about equality checks. Often with composite indexes you'll have an inequality against one of the columns. e.g. queries such as "get the orders/shipments/invoices for a customer in the past N days".

If you have these kinds of queries, you want the equality against the first column of the index:

select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
from   t
where  few_vals < '0000000002'
and    many_vals = '0000000001';
select * 
from   table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));

| Id  | Operation                              | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
|   0 | SELECT STATEMENT                       |          |      1 |        |      1 |00:00:00.01 |      12 |

|   1 |  SORT AGGREGATE                        |          |      1 |      1 |      1 |00:00:00.01 |      12 |

|   2 |   VIEW                                 | VW_DAG_0 |      1 |     10 |     10 |00:00:00.01 |      12 |

|   3 |    HASH GROUP BY                       |          |      1 |     10 |     10 |00:00:00.01 |      12 |

|   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| T        |      1 |     10 |     10 |00:00:00.01 |      12 |

|   5 |      INDEX RANGE SCAN                  | FEW_MANY |      1 |     10 |     10 |00:00:00.01 |       2 |

select count (distinct few_vals || ':' || many_vals || ':' || lots_vals )
from   t
where  few_vals = '0000000001'
and    many_vals < '0000000002';
select * 
from   table(dbms_xplan.display_cursor(null, null, 'IOSTATS LAST -PREDICATE'));

| Id  | Operation                              | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
|   0 | SELECT STATEMENT                       |          |      1 |        |      1 |00:00:00.01 |      12 |      1 |

|   1 |  SORT AGGREGATE                        |          |      1 |      1 |      1 |00:00:00.01 |      12 |      1 |

|   2 |   VIEW                                 | VW_DAG_0 |      1 |      2 |     10 |00:00:00.01 |      12 |      1 |

|   3 |    HASH GROUP BY                       |          |      1 |      2 |     10 |00:00:00.01 |      12 |      1 |

|   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| T        |      1 |      2 |     10 |00:00:00.01 |      12 |      1 |

|   5 |      INDEX RANGE SCAN                  | MANY_FEW |      1 |      1 |     10 |00:00:00.01 |       2 |      1 |

Notice they're using the opposite index.

TL;DR

Columns with equality conditions should go first in index.
If you have multiple columns with equalities in your query, placing the one with the fewest different values first will give the best compression advantage
While index skip scans are possible, you need to be confident this will remain a viable option for the foreseeable future
Composite indexes including near-unique columns give minimal benefits. Be sure you really need to save the 1-2 IOs

1: In some cases it may be worth including a column in an index if this means all the columns in your query are in the index. This enables an index only scan, so you don't need to access the table.

2: If you're licensed for Diagnostics and Tuning, you could force the plan to a skip scan with SQL Plan Management

ADDENDA

PS - the docs you've quoted there are from 9i. That's reeeeeeally old. I'd stick with something more recent

Is a query with select count (distinct few_vals || ':' || many_vals || ':' || lots_vals ) really common? Doesn't Oracle allow the syntax select count (distinct few_vals, many_vals, lots_vals ) - which doesn't do any string concatenation, doesn't need the columns to be text types and doesn't rely on the absence of : character? — ypercubeᵀᴹ, Jan 31 '17 at 20:24
@ypercubeᵀᴹ you can't do a count ( distinct x, y, z ) in Oracle. So you need to do a distinct subquery and count the results or a concatenation like the above. I just did it here to force a table access (rather than index only scan) and just have one row in the result — Chris Saxon, Feb 01 '17 at 11:07

score 3 · Answer 3 · answered Jan 14 '17 at 17:19

Most selective first is useful only when this column is in the actual WHERE clause.

When the SELECT is by a larger group (less selective), and then possibly by other, non-indexed values, an index with less selective columns may still be useful (if there's a reason not to create another one).

If there's a table ADDRESS, with

COUNTRY CITY STREET, something else...

indexing STREET, CITY, COUNTRY will yield the fastest queries with a street name. But querying all streets of a city, the index will be useless, and the query will likely make a full table scan.

Indexing COUNTRY, CITY, STREET may be a bit slower for individual streets, but the index can be used for other queries, only selecting by country and/or city.

score 1 · Answer 4 · answered Jan 30 '17 at 23:06

There are more elements of query contributes to the final decision on what should a Composite Index start with and/or contain besides selectivity of the column.

for example:

what type of query operator are being used: If queries have operators like
">, >=, <, <="
How many actual rows expected as a result of the query: Is the query result is going to be most of the rows from the table.
Is any functions are being used on the table column during Where clause: If the query has any function UPPER, LOWER, TRIM, SUBSTRING used on the column being used in WHERE condition.

yet to keep conversation relevant my below answer applies to the following situation:

"90% type of queries on a given table has WHERE Clause with operator = "
"at most query is returning the 10% the total rows in the table as a result"
"no kind of functions is being used on the table column in WHERE clause"
"most of the time columns in WHERE Clause used are mostly of type number,
string"

In my experience, it is both that DBA should be mindful about.

Let's imagine the only one rule is being applied:

1) If I create index with most selective column being first but that column is not actually used by most queries on that table than it is no use for db engine.

2) If I create an index with the most widely used column in a query being first in the index but column has low selectivity than also my query performance is not going to be good.

I will list columns those are mostly used in 90% of the table queries. Then put those only in the order of most cardinality to least cardinality.

We use indexes for improving the read query performance and that workflow (types of a read query) only should drive the index creation. In fact as the data grows (billions of rows) compressed index may save storage but sure hurt the read query performance.

score 1 · Answer 5 · answered Jun 20 '18 at 10:23

In theory the most selective column yields the fastest search. But at work I just stumbled on a situation where we have a composite index of 3 parts with the most selective part first. (date, author, publishing company lets say, in that order, table monitors thumbs up on posts) and I have a query that uses all 3 parts. Mysql defaults to using the author onlny index skipping the composite index containing company and date despite them being present in my query. I used force index to use the composite and the query actually ran slower. Why did that happen? I shall tell you:

I was selecting a range on the date, so despite the date being highly selective, the fact that we are using it for range scans(even though the range is relatively short, 6 months out of 6 years of data) made the composite harmful for mysql. To use the composite in that particular case, mysql has to grab all articles written since new years then dive into who the author is, and given that the author has not written that many articles compared to other authors, mysql preferred to just find that author.

In another case the query ran much much faster on the composite, the case was when an author was hugely popular and owned most of the records, sorting by date made sense. But mysql did not auto detect that case, I had to force index... So you know, it varies. Range scans could render your selective column useless. The distribution of the data could make cases where columns are more selective for different records...

What I would do differently is shift the date (which again, in theory is the most selective) to the right, since I know I will be performing a range scan on it now and that makes a difference.

If your query had something like WHERE (date BETWEEN @x AND @y) AND (author = @a) AND (publishing company = @p) then an index on (author, publishing_company, date) or on (publishing_company, author, date) would be better and would be used - without forcing it. — ypercubeᵀᴹ, Jul 20 '18 at 09:07

score -2 · Answer 6 · answered Jan 14 '17 at 04:47

-2

Different cases for different situations. Know your goal; then create your indexes and run explain plans for each and you will have your best answer for your situation.

answered Jan 14 '17 at 04:47

RMPJ

1

score -2 · Answer 7 · edited Jan 30 '18 at 05:55

From Column order in Index on Ask Tom:

So, the order of columns in your index depends on HOW YOUR QUERIES are written. You want to be able to use the index for as many queries as you can (so as to cut down on the over all number of indexes you have) -- that will drive the order of the columns. Nothing else (selectivity of a or b does not count at all).

Agree, that we have to order columns based on where clause, but the statement "(selectivity of a or b does not count at all)" is not correct.)". The most selective columns should be leading if it is satisfied first role ("where clause")

Composite indexes: Most selective column first?

7 Answers7

| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |

| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |

| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |

| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads |

| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |

| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads |

| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |

| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads |