2

I have two tables. Table logfile contains whole text file in one attribute; table rawtext should contain one line of text from the text file in each row:

create table LOGFILE ( LOGF_PK serial primary key, TEXT_CONTENT text ) ;
create table LOGTEXT ( LT_PK serial primary key, 
  LOGF_PK integer not null references LOGFILE( LOGF_PK ),
  LINENUM  integer,
  LINETEXT varchar(255)
);

I can split TEXT_CONTENT into a set of rows, but I am unable (don't know how) to generate line numbers in table LOGTEXT, e.g. the first line from LOGFILE.TEXT_CONTENT would be in a row <LT_PK>, <LOGF_PK>, 1, 'the first line of text', etc.

This query returns what I expect to obtain, but without line numbers:

select LOGF_PK, regexp_split_to_table( TEXT_CONTENT, '\r?\n' ) as LINETEXT ;

However, I cannot obtain any line numbers. I tried variant 1:

select LOGF_PK
  , row_number() over (partition by LOGFILE.LOGF_PK) as LINENUM
  , regexp_split_to_table( TEXT_CONTENT, '\r?\n' )) as LINETEXT
from LOGFILE

In the above example, LINENUM is always 1...

And when I try another possibility, variant 2:

select LOGF_PK
  , regexp_split_to_table( TEXT_CONTENT, '\r?\n' )) with ORDINALITY
from LOGFILE

I get error

ERROR:  syntax error at or near "with"
LINE 3:   , regexp_split_to_array( content, '\r?\n' ) with ordinalit...
                                                      ^

Unfortunately, there are very few examples of WITH ORDINALITY in PostgreSQL documentation and the few that are there are very general and unexplaining. And I suspect that some examples from documentation have never worked on a real system.

Database is PostgreSQL 12.8

  • 1
    "And I suspect that some examples from documentation has never worked on a real system" - all examples from the manual will work on a real system. –  Oct 10 '21 at 20:54
  • I see. So it's just lack of complete, illustrative examples of non-trivial features such as this one. After reading the section in the manual up and down, I was unable to make a reasonably working test query. I think it's because in showing the example, they assume some kind of additional knowledge or experience not covered by the manual. – Jindrich Vavruska Oct 11 '21 at 06:58
  • This old answer stills covers everything: https://dba.stackexchange.com/a/27287/3684 – Erwin Brandstetter Oct 11 '21 at 11:44
  • @ErwinBrandstetter I am sorry, but it doesn't. It is identical with the examples in the docs. It shows an example how to handle array as table, but it shows nothing as to how to expand SELECT from table with array-type columns. The key is using FROM x, LATERAL <function returning set> and explaining how use this. There is no such example in the current docs (ie. up to v14), or it is hidden so well that I was unable to find one comprehensive example after several hours of intensive, focused search. IMHO, this feature should be documented better, with a clear example. – Jindrich Vavruska Oct 11 '21 at 19:41
  • @ErwinBrandstetter ... and the other example of LATERAL does not show the main power of the clause because it refers to a different table. How much explanatory is SELECT * FROM t, LATERAL (SELECT * from b ... etc )? It does not explain anything, it just kind of shows that JOIN and LATERAL are more or less the same and does not show any specific benefit of LATERAL compared to JOIN. Such benefit is, however, available when you need to expand array-type column into a set of rows. That's why this example does not help understanding the feature. – Jindrich Vavruska Oct 11 '21 at 19:58
  • LATERAL is not an alternative to JOIN, but a modifier. There is no LATERAL without JOIN. Don't get confused by the syntax shorthand , LATERAL (or even just , before table-functions in the FROM list, both are just short for CROSS JOIN LATERAL. There is a link in my answer to detailed explanation. – Erwin Brandstetter Oct 11 '21 at 21:22
  • The second query in my answer is exactly what you need. SELECT logf_pk, x.* FROM logfile l, regexp_split_to_table(l.text_content, '\r?\n') WITH ORDINALITY x(linetext, linenum);. There is no "different" table. jkavalik and a_horse posted queries to that effect in their answers - one which should be the accepted answer. – Erwin Brandstetter Oct 11 '21 at 21:23
  • 1
    Sorry, you are right. It is so much at the top, and the rest is a different kind of text, that I missed it. I chose the first answer, it is descriptive and explaining. But all three of them are excellent. – Jindrich Vavruska Oct 12 '21 at 08:51

3 Answers3

5

WITH ORDINALITY can only be used when the generating function is used in place of a table, a lateral subquery can allow to use it that way for example:

select
  LOGF_PK,
  LINENUM,
  LINETEXT
from LOGFILE,
lateral regexp_split_to_table( TEXT_CONTENT, '\r?\n' )
            WITH ORDINALITY as t(LINETEXT,LINENUM) ;

another possibility is to move the row_number() call into an outer query:

select *, row_number() over (partition by tmp.LOGF_PK) as LINENUM
from (select LOGF_PK
  , regexp_split_to_table( TEXT_CONTENT, '\r?\n' ) as LINETEXT 
  from LOGFILE
) tmp;

http://sqlfiddle.com/#!17/93ee4/18

jkavalik
  • 5,080
  • 1
  • 12
  • 20
  • Thank you so much! Right now, I am not at the place where I can test it, but I will ASAP. The feature description in PostgreSQL docs is kind of incomplete and confusing. And, of course, I was more confused (or rather misled) when regexp_split_to_table gave more or less the result I expected. Does PostgreSQL documentation lack good examples or did I just miss the right place in the docs? – Jindrich Vavruska Oct 11 '21 at 06:47
  • Imho the documentation does not really give actual examples on the page where the feature is documented. I found a (blog)post instead that gave some and allowed me to compare with your code - https://paquier.xyz/postgresql-2/postgres-9-4-feature-highlight-with-ordinality/ – jkavalik Oct 11 '21 at 06:50
  • Works perfectly, thank you! In my system, I used relatively complex typescript code to handle the input text (split to lines, then split certain lines into array, then test specific pattern at certain array indices, then reorganize the data into a database record). By using regexp_split...() I can do some of the work directly in SQL and avoid some of the complexity, while keeping all data in the database (ie. no need to reference external files). – Jindrich Vavruska Oct 11 '21 at 19:46
3

Move the function to the FROM clause (where it belongs), then you can use with ordinality

select l.LOGF_PK,
       t.linetext,
       t.linenum
from LOGFILE l
  cross join regexp_split_to_table(l.TEXT_CONTENT, '\r?\n' ) with ordinality as t(linetext, linenum)
3

The problem is that you are using a rather confusing feature of PostgreSQL, which is to place a set-returning function in the SELECT, rather than in the FROM. The documentation on it is difficult to find, so it's not surprising most don't understand how it works.

What it will do is apply that set to each row of the resultset, which means that the row-numbering is calculated before the set is laterally applied to the original resultset.

Effectively, your code is transformed

select LOGF_PK
  , row_number() over (partition by LOGFILE.LOGF_PK) as LINENUM
  , regexp_split_to_table( TEXT_CONTENT, '\r?\n' )) as LINETEXT
from LOGFILE

to this

SELECT *
FROM (
    select LOGF_PK
      , row_number() over (partition by LOGFILE.LOGF_PK) as LINENUM
    FROM LOGFILE
) LOGFILE
LATERAL regexp_split_to_table( TEXT_CONTENT, '\r?\n' )) as LINETEXT

So it's quite obvious now why the row-number doesn't work.

Instead place the lateral join in the correct position in the FROM

select LOGF_PK
  , row_number() over (partition by LOGFILE.LOGF_PK) as LINENUM
  , LINETEXT.LINETEXT
from LOGFILE,
LATERAL regexp_split_to_table( TEXT_CONTENT, '\r?\n' )) as LINETEXT

If you want to ensure the row-numbering is in order of the results from regexp_split_to_table, that is when you use WITH ORDINALITY, which returns an extra column to sort by

select LOGF_PK
  , row_number() over (partition by LOGFILE.LOGF_PK ORDER BY LINETEXT.ordinality) as LINENUM
  , LINETEXT.LINETEXT
from LOGFILE,
LATERAL regexp_split_to_table( TEXT_CONTENT, '\r?\n' )) WITH ORDINALITY as LINETEXT
Charlieface
  • 12,780
  • 13
  • 35
  • I agree it's a bit confusing, but it's not undocumented. Quote from the manual "Output expressions that contain set-returning functions are effectively evaluated after sorting and before limiting, so that LIMIT will act to cut off the output from a set-returning function" –  Oct 11 '21 at 05:26
  • 1
    I did give a link, and I meant "somewhat undocumented" because it's buried in the Extending SQL section of the manual, rather than in the section on table functions https://www.postgresql.org/docs/current/queries-table-expressions.html#QUERIES-TABLEFUNCTIONS, it's not something the average tutorial is going to go into detail on. Although I have clarified that a bit better now – Charlieface Oct 11 '21 at 05:52
  • Thank you! I did not understand LATERAL at all, had no idea how to use it. Maybe the examples are not descriptive enough, or maybe I am just dumb :) – Jindrich Vavruska Oct 11 '21 at 06:50
  • Well, in your last example, ORDINALITY (LINETEXT.ordinality) does exactly the same thing I wanted to achieve by means of ROW_NUMBER(). – Jindrich Vavruska Oct 11 '21 at 06:52
  • 1
    Not quite, because ORDINALITY guarantees that that ordinal will be the same order as the rows that come out of the set function, whereas just using ROW_NUMBER does not – Charlieface Oct 11 '21 at 08:20
  • I see, @Charlieface, so ORDINALITY is more deterministic than ROW_NUMBER and it is exactly the purpose I wanted to achieve. – Jindrich Vavruska Oct 11 '21 at 10:39