How to optimize Merge Anti Join?

Question

While trying to find the fastest way to do a bulk update of a large table, I came up with the following plan:

having main_table (potentially above 1000 mln rows) and update table up to a few million rows. Data in update are indexed according to main_table (have corresponding id columns).

copy rows from main_table into temp_table where id not in update id table (where data has actually changed).
simple append of the update table onto temp_table
drop main_table
rename temp_table to main_table
renumerate main_table pk
create the next update with python and repeat.

The above is to be done with the following sequence of queries:

CREATE TABLE temp_table (like main_table including all);
INSERT INTO  temp_table (listing_id,date,available,price,timestamp)
    (SELECT listing_id,date,available,price,timestamp
         FROM   main_table c
         WHERE  NOT EXISTS (SELECT
                            FROM   update
                            WHERE  id = c.id));
INSERT INTO temp_table (listing_id,date,available,price,parsing_timestamp)
    (SELECT listing_id,date,available,price,timestamp
     FROM   update c);
DROP  TABLE main_table cascade;
ALTER TABLE temp_table RENAME TO main_table;
ALTER TABLE main_table DROP COLUMN id
ALTER TABLE main_table ADD COLUMN id SERIAL PRIMARY KEY;
DROP  TABLE update cascade;

While debugging the queries, I have found the slow queries with explain:

EXPLAIN (ANALYZE, BUFFERS) CREATE TABLE temp_table (like main_table including all);
INSERT INTO  temp_table (listing_id,date,available,price,timestamp)
    (SELECT listing_id,date,available,price,timestamp
         FROM   main_table c
         WHERE  NOT EXISTS (SELECT
                            FROM   update
                            WHERE  id = c.id));

This is the result of EXPLAIN (ANALYZE, BUFFERS) as requested in comments:

QUERY PLAN
Merge Anti Join (cost=513077.42..4789463.48 rows=109800833 width=40) (actual time=1216.018..48520.564 rows=112757269 loops=1)
Merge Cond: (c.id = update.id)
Buffers: shared hit=4 read=1359891 written=2838, temp read=8701 written=15077
-> Index Scan using cals_pkey on cals c (cost=0.57..3958576.01 rows=113119496 width=44) (actual time=0.857..29573.798 rows=113119497 loops=1)
Buffers: shared hit=1 read=1330192 written=2838
 -> Sort (cost=513076.85..521373.51 rows=3318663 width=8) (actual time=1215.147..1260.191 rows=362229 loops=1)
Sort Key: update.id
Sort Method: external merge Disk: 58480kB
Buffers: shared hit=3 read=29699, temp read=8701 written=15077
-> Seq Scan on update (cost=0.00..62881.63 rows=3318663 width=8) (actual time=0.179..423.100 rows=3318663 loops=1)
Buffers: shared read=29695
Planning Time: 0.259 ms
Execution Time: 52757.349 ms

How to optimize the queries?

No, the INSERT is the last operation. You need to read the plan from the inside-out. — , Jun 07 '20 at 12:24
@a_horse_with_no_name, thanks, but how to avoid all that Sorts then? — Dmitriy Grankin, Jun 07 '20 at 12:55
Hi and welcome to the forum! Did you run EXPLAIN (ANALYZE, BUFFERS)? If not, could you please replace your plan with the result of that command? — Vérace, Jun 07 '20 at 13:59
@Vérace, Hi, please find the edit above following to your comment — Dmitriy Grankin, Jun 07 '20 at 14:22
50 seconds for 100 million rows seems pretty good. Does it really need to be optimized? — jjanes, Jun 07 '20 at 20:38
You say CREATE TABLE temp_table (like main_table including all); So you should disclose the table definition to give the full picture. It's unclear why you need to drop and recreate the id column at the end. Do you need gapless numbers? Would your use case benefit from physically sorted rows after the UPDATE operation? — Erwin Brandstetter, Jun 08 '20 at 00:20

Erwin Brandstetter · Answer 1 · 2020-06-08T15:47:48.950

Concerning the sort performance in the query, there is this:

Sort Method: external merge Disk: 58480kB

Meaning, there is not enough work_mem to sort in RAM. The operation spills to disk, which is typically a lot more expensive. If you have more RAM to spare, increase the setting for the current session (only!).

See:

Improving sort performance in GROUP BY clause

Check with SHOW work_mem;, then add 100 MB to the number you get and use it below instead of '???MB'. If you still see mention of "disk" in the EXPLAIN output, add more.

But there is more: You create the new table with:

CREATE TABLE temp_table (like main_table including all);

INCLUDING ALL, as the wording suggests, also includes constraints and indexes. Obviously, at least a PK constraint as indicated by cals_pkey in the EXPLAIN output. I assume a column:

id bigserial PRIMARY KEY

It's a lot cheaper to create an index on a filled table than to update it incrementally. The index is also typically better optimized. So you should drop the PK (and all other indexes and constraints) before you write to the table, and recreate them afterwards.

Also, dropping and re-creating the id column on a huge table is needlessly expensive. Create new values for id on the fly with row_number() (do you really need to renumber?). Then update the associated SEQUENCE accordingly.

you need to renumber all rows
no concurrent load on the table
parsing_timestamp is a typo in the question and should really be the same column timestamp. Else, adapt accordingly.

... the operation could look something like this:

BEGIN;

SET LOCAL work_mem = '???MB';  -- see above

CREATE TABLE temp_table (LIKE main_table INCLUDING ALL);

ALTER TABLE temp_table DROP CONSTRAINT temp_table_pkey;  -- before!
-- drop more indexes or constraints?

INSERT INTO temp_table 
      (id                  , listing_id,date,available,price,timestamp)
SELECT row_number() OVER (), listing_id,date,available,price,timestamp
FROM  (
   SELECT listing_id,date,available,price,timestamp  -- No parentheses around SELECT
   FROM   main_table c
   WHERE  NOT EXISTS (SELECT FROM update WHERE id = c.id);

   UNION ALL      
   SELECT listing_id,date,available,price,timestamp  -- parsing_timestamp ???
   FROM   update
   ) sub

DROP  TABLE main_table CASCADE;
DROP  TABLE update CASCADE;

ALTER TABLE temp_table RENAME TO main_table;
ALTER TABLE main_table ADD CONSTRAINT main_table_pkey PRIMARY KEY (id);  -- after!
-- recreate more indexes or constraints?

SELECT setval(pg_get_serial_sequence('main_table', 'id'), max(id)) FROM main_table;  -- reset sequence

COMMIT;

Also, some more of the advice for "Populating a Database" in the manual may well apply to your case. Maybe more maintenance_work_mem, maybe ANALYZE or VACUUM ANALYZE afterwards, ... Have a look.

thanks for your detailed answer, your method is in fact slightly faster, but it still would take a month to populate the database. With Python Pandas, which as I am much more comfortable with, the problem is just a simple append and drop_duplicates. Is there any chance to find a faster solution? — Dmitriy Grankin, Jun 09 '20 at 12:38

How to optimize Merge Anti Join?

1 Answers1