How can I generate all trailing substrings following a delimeter?

Question

Given a string that may contain multiple instances of a delimiter, I want to generate all substrings starting after that character.

For example, given a string like 'a.b.c.d.e' (or array {a,b,c,d,e}, I suppose), I want to generate an array like:

{a.b.c.d.e, b.c.d.e, c.d.e, d.e, e}

The intended usage is as a trigger to fill a column for easier querying of domain name parts (i.e. find all q.x.t.com for query t.com) whenever another column is written to.

It seems like an awkward way to solve this (and it very well may be), but now I'm curious how a function like this could be written in (Postgres') SQL.

These are email domain names so it's hard to say what the maximum possible number of elements is, but certainly the vast majority would be < 5.

@ErwinBrandstetter yes. Sorry for the delay (holidays etc). I picked the trigram index answer because it actually solved my real problem the best. However I am sensitive to the fact that my question was specifically about how I could break apart a string in this way (for curiosity's sake) so I'm not sure if I've used the best metric to choose the accepted answer. — Bo Jeanes, Jan 16 '17 at 03:59
The best answer should be the one best answering the given question. Ultimately, it's your choice. And the chosen one seems like a valid candidate to me. — Erwin Brandstetter, Jan 24 '17 at 03:34

David דודו Markovitz · Answer 1 · 2016-12-08T09:24:56.063

I think this is my favorite.

create table t (id int,str varchar(100));
insert into t (id,str) values (1,'a.b.c.d.e'),(2,'xxx.yyy.zzz');

ROWS

select      id
           ,array_to_string((string_to_array(str,'.'))[i:],'.')

from        t,unnest(string_to_array(str,'.')) with ordinality u(token,i)
;

+----+-----------------+
| id | array_to_string |
+----+-----------------+
|  1 | a.b.c.d.e       |
|  1 | b.c.d.e         |
|  1 | c.d.e           |
|  1 | d.e             |
|  1 | e               |
|  2 | xxx.yyy.zzz     |
|  2 | yyy.zzz         |
|  2 | zzz             |
+----+-----------------+

ARRAYS

select      id
           ,array_agg(array_to_string((string_to_array(str,'.'))[i:],'.'))

from        t,unnest(string_to_array(str,'.')) with ordinality u(token,i)

group by    id
;

+----+-------------------------------------------+
| id |                 array_agg                 |
+----+-------------------------------------------+
|  1 | {"a.b.c.d.e","b.c.d.e","c.d.e","d.e","e"} |
|  2 | {"xxx.yyy.zzz","yyy.zzz","zzz"}           |
+----+-------------------------------------------+

score 4 · Answer 2 · edited Jun 15 '20 at 09:05

create table t (id int,str varchar(100));
insert into t (id,str) values (1,'a.b.c.d.e'),(2,'xxx.yyy.zzz');

ROWS

select  id
       ,regexp_replace(str,'^([^\.]+\.?){' || gs.i || '}','') as suffix
from    t,generate_series(0,cardinality(string_to_array(str,'.'))-1) gs(i)
;

OR

select  id
       ,substring(str from '(([^.]*?\.?){' || gs.i+1 || '})$') as suffix
from    t,generate_series(0,cardinality(string_to_array(str,'.'))-1) gs(i)
;

+----+-------------+
| id | suffix      |
+----+-------------+
| 1  | a.b.c.d.e   |
+----+-------------+
| 1  | b.c.d.e     |
+----+-------------+
| 1  | c.d.e       |
+----+-------------+
| 1  | d.e         |
+----+-------------+
| 1  | e           |
+----+-------------+
| 2  | xxx.yyy.zzz |
+----+-------------+
| 2  | yyy.zzz     |
+----+-------------+
| 2  | zzz         |
+----+-------------+

ARRAYS

select      id
           ,array_agg(regexp_replace(str,'^([^\.]+\.?){' || gs.i || '}','')) as suffixes
from        t,generate_series(0,cardinality(string_to_array(str,'.'))-1) gs(i)
group by    id
;

OR

select      id
           ,array_agg(substring(str from '(([^.]*?\.?){' || gs.i+1 || '})$')) as suffixes
from        t,generate_series(0,cardinality(string_to_array(str,'.'))-1) gs(i)
group by    id
;

+----+-------------------------------------------+
| id |                 suffixes                  |
+----+-------------------------------------------+
|  1 | {"a.b.c.d.e","b.c.d.e","c.d.e","d.e","e"} |
|  2 | {"xxx.yyy.zzz","yyy.zzz","zzz"}           |
+----+-------------------------------------------+

jpmc26 · Accepted Answer · 2017-01-16T08:51:00.047

I don't think you need a separate column here; this is an XY-problem. You're just trying to do a suffix search. There are two main ways to optimize that.

Turn the suffix query into a prefix query

You basically do this by reversing everything.

First create an index on the reverse of your column:

CREATE INDEX ON yourtable (reverse(yourcolumn) text_pattern_ops);

Then query using the same:

SELECT * FROM yourtable WHERE reverse(yourcolumn) LIKE reverse('%t.com');

You can throw in an UPPER call if you want to make it case insensitive:

CREATE INDEX ON yourtable (reverse(UPPER(yourcolumn)) text_pattern_ops);
SELECT * FROM yourtable WHERE reverse(UPPER(yourcolumn)) LIKE reverse(UPPER('%t.com'));

Trigram Indexes

The other option is trigram indexes. You should definitely use this if you need infix queries (LIKE 'something%something' or LIKE '%something%' type queries).

First enable the trigram index extension:

CREATE EXTENSION pg_trgm;

(This should come with PostgreSQL out of the box without any extra installation.)

Then create a trigram index on your column:

CREATE INDEX ON yourtable USING GIST(yourcolumn gist_trgm_ops);

Then just select:

SELECT * FROM yourtable WHERE yourcolumn LIKE '%t.com';

Again, you can throw in an UPPER to make it case insensitive if you like:

CREATE INDEX ON yourtable USING GIST(UPPER(yourcolumn) gist_trgm_ops);
SELECT * FROM yourtable WHERE UPPER(yourcolumn) LIKE UPPER('%t.com');

Your question as written

Trigram indexes actually work using a somewhat more general form of what you're asking for under the hood. It breaks the string down into pieces (trigrams) and builds an index based on those. The index can then be used to search for matches much more quickly than a sequential scan, but for infix as well as suffix and prefix queries. Always try to avoid reinventing what someone else has developed when you can.

Credits

The two solutions are pretty much verbatim from Choosing a PostgreSQL text search method. I highly recommend giving it a read for a detailed analysis of the available text search options in PotsgreSQL.

Comments are not for extended discussion; this conversation has been moved to chat. — Paul White, Dec 10 '16 at 03:15
I didn't come back to this until after Christmas, so apologies for the delay in choosing an answer. Trigram indexes ended up being the easiest thing in my case and helped me the most, though it's the least literal answer to the question asked, so I'm not sure what SE's policy there is for choosing appropriate answers. Either way, thank you all for your help. — Bo Jeanes, Jan 16 '17 at 03:57

Erwin Brandstetter · Answer 4 · 2022-10-29T04:44:15.537

Question asked

Test table:

CREATE TABLE tbl (id int, str text);
INSERT INTO tbl VALUES
  (1, 'a.b.c.d.e')
, (2, 'x1.yy2.zzz3')     -- different number & length of elements for testing
, (3, '')                -- empty string
, (4, NULL)              -- NULL
;

Recursive CTE in a LATERAL subquery

SELECT *
FROM   tbl, LATERAL (
   WITH RECURSIVE cte AS (
      SELECT str
      UNION ALL
      SELECT right(str, strpos(str, '.') * -1)  -- trim leading name
      FROM   cte
      WHERE  str LIKE '%.%'  -- stop after last dot removed
      )
   SELECT ARRAY(TABLE cte) AS result
   ) r;

The CROSS JOIN LATERAL (, LATERAL for short) is safe, because the aggregate result of the subquery always returns a row. You get ...

... an array with an empty string element for str = '' in the base table
... an array with a NULL element for str IS NULL in the base table

Wrapped up with a cheap array constructor in the subquery, so no aggregation in the outer query.

A showpiece of SQL features, but the rCTE overhead may prevent top performance.

Brute force for trivial number of elements

For your case with a trivially small number of elements, a simple approach without subquery may be faster:

SELECT id, array_remove(ARRAY[substring(str, '(?:[^.]+\.){4}[^.]+$')
                            , substring(str, '(?:[^.]+\.){3}[^.]+$')
                            , substring(str, '(?:[^.]+\.){2}[^.]+$')
                            , substring(str,        '[^.]+\.[^.]+$')
                            , substring(str,               '[^.]+$')], NULL)
FROM   tbl;

Assuming a maximum of 5 elements like you commented. You can easily expand for more.

If a given domain has fewer elements, excess substring() expressions return NULL and are removed by array_remove().

Actually, the expression from above(right(str, strpos(str, '.')), nested several times may be faster (though awkward to read) since regular expression functions are more expensive.

A fork of @Dudu's query

@Dudu's smart query might be improved with generate_subscripts():

SELECT id, array_agg(array_to_string(arr[i:], '.')) AS result
FROM  (SELECT id, string_to_array(str,'.') AS arr FROM tbl) t
LEFT   JOIN LATERAL generate_subscripts(arr, 1) i ON true
GROUP  BY id;

Also using LEFT JOIN LATERAL ... ON true to preserve possible rows with NULL values.

What is the difference between LATERAL and a subquery in PostgreSQL?

PL/pgSQL function

Similar logic as the rCTE. Substantially simpler and faster than what you have:

CREATE OR REPLACE FUNCTION string_part_seq(input text, OUT result text[])
  LANGUAGE plpgsql IMMUTABLE STRICT AS
$func$
BEGIN
   LOOP
      result := result || input;  -- text[] || text array concatenation
      input  := right(input, strpos(input, '.') * -1);
      EXIT WHEN input = '';
   END LOOP;
END
$func$;

The OUT parameter is returned at the end of the function automatically.

There is no need to initialize result, because NULL::text[] || text 'a' = '{a}'::text[].
_{This only works with 'a' being properly typed. NULL::text[] || 'a' (string literal) would raise an error because Postgres picks the array || array operator.}

strpos() returns 0 if no dot is found, so right() returns an empty string and the loop ends.

This is probably the fastest of all solutions here.

All of them work in Postgres 9.3+
^{(except for the short array slice notation arr[3:]. I added an upper bound in the fiddle to make it work in pg 9.3: arr[3:999].)}

fiddle
_sqlfiddle

Different approach to optimize search

I am with @jpmc26 (and yourself): a completely different approach will be preferable. I like jpmc26's combination of reverse() and a text_pattern_ops.

A trigram index would be superior for partial or fuzzy matches. But since you are only interested in whole words, Full Text Search is another option. I expect a substantially smaller index size and thus better performance.

pg_trgm as well as FTS support case insensitive queries, btw.

Host names like q.x.t.com or t.com (words with inline dots) are identified as type "host" and treated as one word. But there is also prefix matching in FTS (which seems to be overlooked sometimes). The manual:

Also, * can be attached to a lexeme to specify prefix matching:

Using @jpmc26's smart idea with reverse(), we can make this work:

SELECT *
FROM   tbl
WHERE  to_tsvector('simple', reverse(str))
    @@ to_tsquery ('simple', reverse('c.d.e') || ':*');
-- or with reversed prefix:  reverse('*:c.d.e')

Which is supported by an index:

CREATE INDEX tbl_host_idx ON tbl USING GIN (to_tsvector('simple', reverse(str)));

Note the 'simple' configuration: We do not want the stemming or thesaurus used with the default 'english' configuration.

Alternatively (with a bigger variety of possible queries) we could use the new phrase search capability of text search in Postgres 9.6. The release notes:

A phrase-search query can be specified in tsquery input using the new operators <-> and <N>. The former means that the lexemes before and after it must appear adjacent to each other in that order. The latter means they must be exactly N lexemes apart.

Query:

SELECT *
FROM   tbl
WHERE  to_tsvector     ('simple', replace(str, '.', ' '))
    @@ phraseto_tsquery('simple', 'c d e');

Replace dot ('.') with space (' ') to keep the parser from classifying 't.com' as host name and instead use each word as separate lexeme.

And a matching index to go with it:

CREATE INDEX tbl_phrase_idx ON tbl USING GIN (to_tsvector('simple', replace(str, '.', ' ')));

score 2 · Answer 5 · answered Dec 08 '16 at 06:50

I came up with something semi-workable, but I'd love feedback on the approach. I have written very little PL/pgSQL so feel like everything I do is quite hacky and I'm surprised when it works.

Nonetheless, this is where I got to:

CREATE OR REPLACE FUNCTION string_part_sequences(input text, separator text)
RETURNS text[]
LANGUAGE plpgsql
AS $$
  DECLARE
    parts text[] := string_to_array(input, separator);
    result text[] := '{}';
    i int;
  BEGIN
    FOR i IN SELECT generate_subscripts(parts, 1) - 1
    LOOP
      SELECT array_append(result, (
          SELECT array_to_string(array_agg(x), separator)
          FROM (
            SELECT *
            FROM unnest(parts)
            OFFSET i
          ) p(x)
        )
      )
      INTO result;
    END LOOP;
    RETURN result;
  END;
$$
STRICT IMMUTABLE;

This works like so:

# SELECT string_part_sequences('mymail.unisa.edu.au', '.');
┌──────────────────────────────────────────────┐
│            string_part_sequences             │
├──────────────────────────────────────────────┤
│ {mymail.unisa.edu.au,unisa.edu.au,edu.au,au} │
└──────────────────────────────────────────────┘
(1 row)

Time: 1.168 ms

I added a simpler plpgsql function to my answer. – Erwin Brandstetter Dec 09 '16 at 07:03 — Erwin Brandstetter, Dec 09 '16 at 07:03

score 1 · Answer 6 · answered Dec 08 '16 at 07:42

I use window function:

with t1 as (select regexp_split_to_table('ab.ac.xy.yx.md','\.') as str),
     t2 as (select string_agg(str,'.') over ( rows between current row and unbounded following) as str from t1 ),
     t3 as (select array_agg(str) from t2)
     select * from t3 ;

Result:

postgres=# with t1 as (select regexp_split_to_table('ab.ac.xy.yx.md','\.') as str),
postgres-#      t2 as (select string_agg(str,'.') over ( rows between current row and unbounded following) as str from t1 ),
postgres-#      t3 as (select array_agg(str) from t2)
postgres-#      select * from t3 ;
                   array_agg
------------------------------------------------
 {ab.ac.xy.yx.md,ac.xy.yx.md,xy.yx.md,yx.md,md}
(1 row)

Time: 0.422 ms
postgres=# with t1 as (select regexp_split_to_table('mymail.unisa.edu.au','\.') as str),
postgres-#      t2 as (select string_agg(str,'.') over ( rows between current row and unbounded following) as str from t1 ),
postgres-#      t3 as (select array_agg(str) from t2)
postgres-#      select * from t3 ;
                  array_agg
----------------------------------------------
 {mymail.unisa.edu.au,unisa.edu.au,edu.au,au}
(1 row)

Time: 0.328 ms

joanolo · Answer 7 · 2016-12-08T21:39:01.623

A variant of the solution by @Dudu Markovitz, that also works with versions of PostgreSQL that do not (yet) recognize [i:]:

create table t (id int,str varchar(100));
insert into t (id,str) values (1,'a.b.c.d.e'),(2,'xxx.yyy.zzz');

SELECT    
    id, array_to_string(the_array[i:upper_bound], '.')
FROM     
    (
    SELECT
        id, 
        string_to_array(str, '.') the_array, 
        array_upper(string_to_array(str, '.'), 1) AS upper_bound
    FROM
        t
    ) AS s0, 
    generate_series(1, upper_bound) AS s1(i)

How can I generate all trailing substrings following a delimeter?

7 Answers7

ROWS

ARRAYS

ROWS

ARRAYS

Turn the suffix query into a prefix query

Trigram Indexes

Your question as written

Credits

Question asked

Recursive CTE in a LATERAL subquery

Brute force for trivial number of elements

A fork of @Dudu's query

PL/pgSQL function

Different approach to optimize search

Linked