How to ignore duplicates during bulk inserts?

Question

In Postgres 9.3.5, I'm importing records from an external source where duplicates are VERY rare, but they do happen. Given a readings table with a unique compound key on (real_time_device_id, recorded_at), the following will fail once in a blue moon:

INSERT INTO readings (real_time_device_id, recorded_at, duration) VALUES
  ('150', TIMESTAMP WITH TIME ZONE '2014-11-01 23:06:33 -0700', 10.0),
  ... many more records ...
  ('150', TIMESTAMP WITH TIME ZONE '2014-11-01 23:06:43 -0700', 10.0);

(FWIW, the above fails 'properly' with a duplicate key violation.)

I know that handling exceptions is expensive, but as I said, duplicate entries are very rare. So to keep the code simple, I followed an example given in Optimal way to ignore duplicate inserts?:

BEGIN
    INSERT INTO readings (real_time_device_id, recorded_at, duration) VALUES
      ('150', TIMESTAMP WITH TIME ZONE '2014-11-01 23:06:33 -0700', 10.0),
      ... many more records ...
      ('150', TIMESTAMP WITH TIME ZONE '2014-11-01 23:06:43 -0700', 10.0);
EXCEPTION WHEN unique_violation THEN
  -- silently ignore inserts
END;

The above gets two errors:

psql:sketches/t15.sql:11: ERROR:  syntax error at or near "INSERT"
LINE 2:         INSERT INTO readings (real_time_device_id, recorded_...
                ^
psql:sketches/t15.sql:14: ERROR:  syntax error at or near "EXCEPTION"
LINE 1: EXCEPTION WHEN unique_violation THEN
        ^

Can anyone set me straight on the correct syntax? Or is my error deeper than mere syntax? (For example, will all of the INSERTs be ignored if there is a single duplicate?)

Generally, what's a good way to do bulk inserts where very few (< .1%) are duplicates?

The code you use above is meant to be the body of a pl/pgsql function. You can either create a real function using it, or use a DO block. However, be careful, this way you won't silently get over the failing row. The whole thing won't throw an error but you won't have any lines inserted, either. — András Váczi, Dec 02 '14 at 09:16
@dezso - thanks for the clarification. I'm considering doing a join on an immediate table with the incumbent table to see if there's any overlap, and if there's no overlamp, do the bulk insert. If there is overlap, I'll check / insert records one by one. — fearless_fool, Dec 02 '14 at 15:41
You can do also the following: insert your data into a staging table (without a unique constraint), remove the duplicates and insert the remainder into the final table. Depending on some factors (like the amount of data) this may be better or worse performance-wise than the approach in your comment. — András Váczi, Dec 02 '14 at 16:48
If I understand you correctly, this assumes that any duplicates are in the insert statement. Duplicates are more likely between a record already in the table and a newly inserted record. (I'm inserting ~250K records in batches of 500 each.) — fearless_fool, Dec 02 '14 at 23:33

Erwin Brandstetter · Accepted Answer · 2023-06-28T00:25:30.423

There are 3 possible kinds of duplicates:

Duplicates within the rows of the bulk insert. That's your immediate cause for the exception.
Duplicates between inserted rows and existing rows.
Duplicates between inserted rows and concurrently inserted / updated rows from other transactions.

(Assuming no overlapping rows are ever deleted concurrently, which would introduce new challenges.)

1. and 2. can be fixed easily. But you really need to define exactly how to solve conflicts. Which row is to be picked?

INSERT INTO readings (real_time_device_id, recorded_at, duration)
SELECT DISTINCT ON (real_time_device_id, recorded_at)  -- solves 1.
       i.real_time_device_id, i.recorded_at, i.dur
FROM  (
   VALUES
     ('150', TIMESTAMP WITH TIME ZONE '2014-11-01 23:06:33 -0700', 10.0)
   , ('150', TIMESTAMP WITH TIME ZONE '2014-11-01 23:06:43 -0700', 10.0)
   ) i (real_time_device_id, recorded_at, dur)
LEFT   JOIN readings r1 USING (real_time_device_id, recorded_at)
WHERE  r1.real_time_device_id IS NULL                  -- solves 2.

I am picking an arbitrary row from each set of dupes with DISTINCT ON. You may want to define deterministically instead:

Select first row in each GROUP BY group?

3. is more tricky - but hopefully not applicable to your case ...

If it applies after all, consider this:

How to use RETURNING with ON CONFLICT in PostgreSQL?

To insert ~250K rows, it would be much more efficient to COPY the lot to a temp table (or as much as you can easily store in RAM and process without spilling to disk) and proceed from there.

If that's a large part of the table, it might pay to drop indexes you don't need for the duration of the update and create them afterwards ...

Great answer -- thanks! Elegant use of LEFT JOIN .. WHERE .. IS NULL. Re: resolving duplicates: in this application, if there are two simultaneous readings, it doesn't matter which I choose. Re: concurrent writes: Mercifully, I'm the sole writer to the table (so far). Re: COPY: I'll investigate. — fearless_fool, Dec 03 '14 at 06:04

How to ignore duplicates during bulk inserts?

1 Answers1

Linked