Question asked
Test table:
CREATE TABLE tbl (id int, str text);
INSERT INTO tbl VALUES
(1, 'a.b.c.d.e')
, (2, 'x1.yy2.zzz3') -- different number & length of elements for testing
, (3, '') -- empty string
, (4, NULL) -- NULL
;
Recursive CTE in a LATERAL subquery
SELECT *
FROM tbl, LATERAL (
WITH RECURSIVE cte AS (
SELECT str
UNION ALL
SELECT right(str, strpos(str, '.') * -1) -- trim leading name
FROM cte
WHERE str LIKE '%.%' -- stop after last dot removed
)
SELECT ARRAY(TABLE cte) AS result
) r;
The CROSS JOIN LATERAL (, LATERAL for short) is safe, because the aggregate result of the subquery always returns a row. You get ...
- ... an array with an empty string element for
str = '' in the base table
- ... an array with a NULL element for
str IS NULL in the base table
Wrapped up with a cheap array constructor in the subquery, so no aggregation in the outer query.
A showpiece of SQL features, but the rCTE overhead may prevent top performance.
Brute force for trivial number of elements
For your case with a trivially small number of elements, a simple approach without subquery may be faster:
SELECT id, array_remove(ARRAY[substring(str, '(?:[^.]+\.){4}[^.]+$')
, substring(str, '(?:[^.]+\.){3}[^.]+$')
, substring(str, '(?:[^.]+\.){2}[^.]+$')
, substring(str, '[^.]+\.[^.]+$')
, substring(str, '[^.]+$')], NULL)
FROM tbl;
Assuming a maximum of 5 elements like you commented. You can easily expand for more.
If a given domain has fewer elements, excess substring() expressions return NULL and are removed by array_remove().
Actually, the expression from above(right(str, strpos(str, '.')), nested several times may be faster (though awkward to read) since regular expression functions are more expensive.
A fork of @Dudu's query
@Dudu's smart query might be improved with generate_subscripts():
SELECT id, array_agg(array_to_string(arr[i:], '.')) AS result
FROM (SELECT id, string_to_array(str,'.') AS arr FROM tbl) t
LEFT JOIN LATERAL generate_subscripts(arr, 1) i ON true
GROUP BY id;
Also using LEFT JOIN LATERAL ... ON true to preserve possible rows with NULL values.
PL/pgSQL function
Similar logic as the rCTE. Substantially simpler and faster than what you have:
CREATE OR REPLACE FUNCTION string_part_seq(input text, OUT result text[])
LANGUAGE plpgsql IMMUTABLE STRICT AS
$func$
BEGIN
LOOP
result := result || input; -- text[] || text array concatenation
input := right(input, strpos(input, '.') * -1);
EXIT WHEN input = '';
END LOOP;
END
$func$;
The OUT parameter is returned at the end of the function automatically.
There is no need to initialize result, because NULL::text[] || text 'a' = '{a}'::text[].
This only works with 'a' being properly typed. NULL::text[] || 'a' (string literal) would raise an error because Postgres picks the array || array operator.
strpos() returns 0 if no dot is found, so right() returns an empty string and the loop ends.
This is probably the fastest of all solutions here.
All of them work in Postgres 9.3+
(except for the short array slice notation arr[3:]. I added an upper bound in the fiddle to make it work in pg 9.3: arr[3:999].)
fiddle
sqlfiddle
Different approach to optimize search
I am with @jpmc26 (and yourself): a completely different approach will be preferable. I like jpmc26's combination of reverse() and a text_pattern_ops.
A trigram index would be superior for partial or fuzzy matches. But since you are only interested in whole words, Full Text Search is another option. I expect a substantially smaller index size and thus better performance.
pg_trgm as well as FTS support case insensitive queries, btw.
Host names like q.x.t.com or t.com (words with inline dots) are identified as type "host" and treated as one word. But there is also prefix matching in FTS (which seems to be overlooked sometimes). The manual:
Also, * can be attached to a lexeme to specify prefix matching:
Using @jpmc26's smart idea with reverse(), we can make this work:
SELECT *
FROM tbl
WHERE to_tsvector('simple', reverse(str))
@@ to_tsquery ('simple', reverse('c.d.e') || ':*');
-- or with reversed prefix: reverse('*:c.d.e')
Which is supported by an index:
CREATE INDEX tbl_host_idx ON tbl USING GIN (to_tsvector('simple', reverse(str)));
Note the 'simple' configuration: We do not want the stemming or thesaurus used with the default 'english' configuration.
Alternatively (with a bigger variety of possible queries) we could use the new phrase search capability of text search in Postgres 9.6. The release notes:
A phrase-search query can be specified in tsquery input using the new
operators <-> and <N>. The former means that the lexemes before and
after it must appear adjacent to each other in that order. The latter
means they must be exactly N lexemes apart.
Query:
SELECT *
FROM tbl
WHERE to_tsvector ('simple', replace(str, '.', ' '))
@@ phraseto_tsquery('simple', 'c d e');
Replace dot ('.') with space (' ') to keep the parser from classifying 't.com' as host name and instead use each word as separate lexeme.
And a matching index to go with it:
CREATE INDEX tbl_phrase_idx ON tbl USING GIN (to_tsvector('simple', replace(str, '.', ' ')));