Postgres: Dictionary to remove leading zeroes from alpha-numerical words

Question

I am running a full text search in Postgres 10 over texts which include a variety of alphanumerical IDs (as well as 'real' text). My clients want to be able to find those IDs without the need to enter the leading zeroes. As the number of digits vary for different IDs I do not want to go down the path of adding extra zero-padded search queries.

I was wondering if there is a way to remove leading zeroes using a dictionary. But this would require some wildcards to catch an alphanumeric expression with leading zeroes. My understanding so far is that this is not supported by the ispell/hunspell dictionaries supported by Postgres 10.

I am currently thinking about using regexes like /\W[0]+(:alnum:)+/ and adding the matches to the text before feeding the text to the search index. This way I would catch both the zero-padded original and the variant without leading zeroes. But it does feel a little clumsy...

Does anybody have a better idea on how to do this using a full text search dictionary (or something similar)?

The texts I am search look like this:

'Max Mustermann 00023131021 Wunderstr. 17 Trallerhausen 002ed2323'

Example - they are typically much longer in reality and contain up to 50 different IDs.
Here, I would like to hit on '00023131021', '23131021', '2ed2323 & Wunder:*', etc.

Your current index definition and an example query would be helpful, too. And are leading zeros always insignificant? So we could remove them from the index? What about words like 0 or 000? — Erwin Brandstetter, Apr 12 '18 at 13:51
If you're willing to implement this with a bit of C code, then it's a simple task for a custom dictionary. You may use dict_int as an example and skeleton. — Daniel Vérité, Apr 13 '18 at 10:24

Erwin Brandstetter · Answer 1 · 2018-04-30T01:56:02.773

Postgres text search only supports prefix matching, not postfix matching like you could use:

Get partial match from GIN indexed TSVECTOR column

Depending on exact requirements a number of workarounds come to mind.

If all leading zeros are insignificant, you could remove those from the indexed column in an expression index like:

CREATE INDEX foo_idx ON tbl
USING GIN (to_tsvector('simple', regexp_replace(col, '\m0+', '', 'g')));

Or use a regular expression fitting your actual requirements:

dbfiddle here

Then match the index in queries like:

SELECT *
FROM   tbl
WHERE  to_tsvector('simple', regexp_replace(col, '\m0+', '', 'g'))
    @@ to_tsquery ('simple', '2ed2323 & Wunder:*');

Remove leading zeroes from the tsquery value as well - if those can be contained:

   ...
    @@ to_tsquery( 'simple', regexp_replace(my_querystring, '\m0+', '', 'g'));

PostgreSQL GIN index not used when ts_query language is fetched from a column

Postgres: Dictionary to remove leading zeroes from alpha-numerical words

1 Answers1