While trying to find the fastest way to do a bulk update of a large table, I came up with the following plan:
having main_table (potentially above 1000 mln rows) and update table up to a few million rows. Data in update are indexed according to main_table (have corresponding id columns).
- copy rows from
main_tableintotemp_tablewhereidnot inupdateidtable (where data has actually changed). - simple append of the
updatetable ontotemp_table - drop
main_table - rename
temp_tabletomain_table - renumerate
main_tablepk - create the next
updatewithpythonand repeat.
The above is to be done with the following sequence of queries:
CREATE TABLE temp_table (like main_table including all);
INSERT INTO temp_table (listing_id,date,available,price,timestamp)
(SELECT listing_id,date,available,price,timestamp
FROM main_table c
WHERE NOT EXISTS (SELECT
FROM update
WHERE id = c.id));
INSERT INTO temp_table (listing_id,date,available,price,parsing_timestamp)
(SELECT listing_id,date,available,price,timestamp
FROM update c);
DROP TABLE main_table cascade;
ALTER TABLE temp_table RENAME TO main_table;
ALTER TABLE main_table DROP COLUMN id
ALTER TABLE main_table ADD COLUMN id SERIAL PRIMARY KEY;
DROP TABLE update cascade;
While debugging the queries, I have found the slow queries with explain:
EXPLAIN (ANALYZE, BUFFERS) CREATE TABLE temp_table (like main_table including all);
INSERT INTO temp_table (listing_id,date,available,price,timestamp)
(SELECT listing_id,date,available,price,timestamp
FROM main_table c
WHERE NOT EXISTS (SELECT
FROM update
WHERE id = c.id));
This is the result of EXPLAIN (ANALYZE, BUFFERS) as requested in comments:
QUERY PLAN
Merge Anti Join (cost=513077.42..4789463.48 rows=109800833 width=40) (actual time=1216.018..48520.564 rows=112757269 loops=1)
Merge Cond: (c.id = update.id)
Buffers: shared hit=4 read=1359891 written=2838, temp read=8701 written=15077
-> Index Scan using cals_pkey on cals c (cost=0.57..3958576.01 rows=113119496 width=44) (actual time=0.857..29573.798 rows=113119497 loops=1)
Buffers: shared hit=1 read=1330192 written=2838
-> Sort (cost=513076.85..521373.51 rows=3318663 width=8) (actual time=1215.147..1260.191 rows=362229 loops=1)
Sort Key: update.id
Sort Method: external merge Disk: 58480kB
Buffers: shared hit=3 read=29699, temp read=8701 written=15077
-> Seq Scan on update (cost=0.00..62881.63 rows=3318663 width=8) (actual time=0.179..423.100 rows=3318663 loops=1)
Buffers: shared read=29695
Planning Time: 0.259 ms
Execution Time: 52757.349 ms
How to optimize the queries?
Sorts then? – Dmitriy Grankin Jun 07 '20 at 12:55EXPLAIN (ANALYZE, BUFFERS)?If not, could you please replace your plan with the result of that command? – Vérace Jun 07 '20 at 13:59CREATE TABLE temp_table (like main_table including all);So you should disclose the table definition to give the full picture. It's unclear why you need to drop and recreate theidcolumn at the end. Do you need gapless numbers? Would your use case benefit from physically sorted rows after theUPDATEoperation? – Erwin Brandstetter Jun 08 '20 at 00:20parsing_timestamp<>timestamp? – Erwin Brandstetter Jun 08 '20 at 00:26