4

In my PostgreSQL 12.8 database, I have a relatively simple table the_table with a column value with type varchar:

CREATE TABLE public.the_table (
    id uuid DEFAULT gen_random_uuid() NOT NULL,
    label character varying,
    value character varying,
    created_at timestamp without time zone NOT NULL,
    updated_at timestamp without time zone NOT NULL,
);

I would like to query for all rows with a value that is formatted as an email address. The query looks like this: SELECT * FROM the_table WHERE value ~ '^[a-zA-Z0-9.$%&*+/=?^_{|}~-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9]+$'.

As there are a several million rows in that table, I try to speed up this query by adding a matching expression index with CREATE INDEX index_the_table_on_email_values ON the_table ((value ~ '^[a-zA-Z0-9.$%&*+/=?^_{|}~-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9]+$'));

Unfortunately, the query planner does not utilize the index and performs a full scan on the table instead, which is very slow.

Can anybody help me fixing the index or tell me what other options I have? I already considered a generated boolean column is_email instead. I could add an index to that generated column and query it directly. But this seems like a weird workaround for the original problem, which should be solvable with a matching index, correct?

1 Answers1

6

To be fair, your index on a boolean expression basically works, too.

The point is this: if there is a large percentage of "email" rows, no index is going to help (much) - except for special cases. Postgres will typically chose a faster sequential scan instead. (I suspect that's your case.)
And if there are only few "email" rows, a partial index instead is much more efficient as it excludes most rows to begin with:

CREATE INDEX the_table_email_idx ON the_table ((true))
WHERE value ~ '^[a-zA-Z0-9.$%&*+/=?^_{|}~-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9]+$';

(true) is just an arbitrary constant, since there is no obvious index column. Typically, you have a useful index column on top of the "email filter" that can replace that constant - to make the index even more useful. Related:

Of course, the idea with a generated column is_email isn't that bad either. You would then create a partial index with the condition on that generated column instead. There are pros and cons to this.

Erwin Brandstetter
  • 175,982
  • 27
  • 439
  • 600