6

I'm trying to insert data from one PostgreSQL 9.3 table to another while removing duplicate entries.

I have two tables containing email addresses.

The main table contains emails and a tag for each email address. The combination (email, tag) must be unique, and for this there is a constraint Unique(email, tag) on it.

The second table get created on the fly from text files which contain only the email address. There are many duplicates there.

I need to import the data from the temp table into the main one without breaking the constraint above. For a certain file containing email addresses the tag is constant.

Table structures:

CREATE TABLE emails (   
  email character varying(128),
  tag bigint,
  CONSTRAINT "unique-tag-email" UNIQUE (email, tag) )

and

CREATE TABLE emails_temp (email character varying(128)

Here is my query:

insert into emails(tag,email) 
select 
 655,t.email 
from 
 emails_temp as t 
where 
 not exists ( select email from emails where email = t.email )

Note: 655 is just a tag of a certain group of email addresses.

This is the error I get:

ERROR: duplicate key value violates unique constraint "unique-tag-email" SQL state: 23505 Detail: Key (email, tag)=(user@hotmail.com, 655) already exists.

There are indeed, two email addresses user@hotmail.com in the file.

Useless to say, because of this error, there is nothing added in the main table (email).

What am I doing wrong?

Sorin
  • 63
  • 1
  • 1
  • 4
  • You are not doing anything wrong. You are getting what you coded. 1st There is a unique key. 2nd The where condition in the correlated query select results that violate the unique key....Since user@hotmail.com, 655 already exists, the constraint is violated and therefore the DML won't succeed – Jason Krs Jan 09 '17 at 21:58
  • 2
    well, to insert distinct values you may want to try selecting distinct values: insert ... select distinct 655, t.email ... – mustaccio Jan 09 '17 at 22:18
  • Thanks for the comments.However, I was under the assumption that the clause where not exists ( select email from emails where email = t.email ) is there exactly to prevent email addresses to be duplicated.

    Select distinct doesn't work either. I get the same error as before.

    – Sorin Jan 10 '17 at 09:14

1 Answers1

5

There are 3 possible kinds of duplicates:

  1. Duplicates within the rows of the bulk insert.

  2. Duplicates between inserted rows and existing rows.

  3. Duplicates between inserted rows and concurrently inserted / updated rows from other transactions.

Just like I explained in this closely related answer:

But things have become easier for 2. and 3. since Postgres 9.5 introduced UPSERT (INSERT .. ON CONFLICT DO NOTHING).

INSERT INTO emails(tag,email)
SELECT DISTINCT 655, email
FROM   emails_temp
ON CONFLICT (email) DO NOTHING;

If your duplicates only stem from duplicate entries in the source (1.), like you indicated, then all you need is DISTINCT. Works in any version of Postgres:

INSERT INTO emails(tag,email)
SELECT DISTINCT 655, email
FROM   emails_temp;
Erwin Brandstetter
  • 175,982
  • 27
  • 439
  • 600
  • Erwin, thanks for the idea. I read about it too, but I can't upgrade to PSQL 9.5. I am stuck with 9.3 for some other reasons which I can't control. – Sorin Jan 10 '17 at 09:21
  • @Sorin: Seems like all you need is DISTINCT, really. I added a bit above. – Erwin Brandstetter Jan 10 '17 at 14:09
  • 1
    For previous versions without ON CONFLICT: INSERT INTO emails (tag,email) SELECT DISTINCT 655, email FROM emails_temp EXCEPT SELECT tag, email FROM emails WHERE tag = 655 ; – ypercubeᵀᴹ Jan 10 '17 at 15:25