104

I have a huge table - 36 million rows - in SQLite3. In this very large table, there are two columns:

  • hash - text
  • d - real

Some of the rows are duplicates. That is, both hash and d have the same values. If two hashes are identical, then so are the values of d. However, two identical d's does not imply two identical hash'es.

I want to delete the duplicate rows. I don't have a primary key column.

What's the fastest way to do this?

jww
  • 90,984
  • 81
  • 374
  • 818
Patches
  • 1,383
  • 2
  • 11
  • 11
  • Please place answers in Answer blocks. Later, you can accept your own Answer. Also see [How does accepting an answer work?](https://meta.stackexchange.com/q/5234/173448) – jww Nov 02 '19 at 02:51

3 Answers3

137

You need a way to distinguish the rows. Based on your comment, you could use the special rowid column for that.

To delete duplicates by keeping the lowest rowid per (hash,d):

delete   from YourTable
where    rowid not in
         (
         select  min(rowid)
         from    YourTable
         group by
                 hash
         ,       d
         )
lnafziger
  • 25,672
  • 8
  • 59
  • 101
Andomar
  • 225,110
  • 44
  • 364
  • 390
  • SQLite doesn't let you add a primary key column, does it? – Patches Nov 19 '11 at 00:10
  • `sqlite> alter table dist add id integer primary key autoincrement; Error: Cannot add a PRIMARY KEY column` – Patches Nov 19 '11 at 00:18
  • Interesting! The part you need is the `autoincrement` though, does it work if you omit the `primary key` part? – Andomar Nov 19 '11 at 00:42
  • `sqlite> alter table dist add id integer autoincrement;` `Error: near "autoincrement": syntax error` Edit: SQLite does have a "rowid" pseudo column type thingy that is automatically there, could I use this? – Patches Nov 19 '11 at 01:09
  • 1
    `delete from dist where rowid not in (select max(rowid) from dist group by hash);` Appears to do the trick! Thanks. – Patches Nov 19 '11 at 01:19
  • 1
    Interesting again, didn't know about rowid! It would meet the requirement since the column has to be unique and comparable. I've edited the answer. – Andomar Nov 19 '11 at 01:19
  • Can it work for min(timestamp) as well? If I cannot add rowId in BQ? – Elad Benda Sep 22 '21 at 00:31
5

I guess the fastest would be to use the very database for it: add a new table with the same columns, but with proper constraints (a unique index on hash/real pair?), iterate through the original table and try to insert records in the new table, ignoring constraint violation errors (i.e. continue iterating when exceptions are raised).

Then delete the old table and rename the new to the old one.

MaDa
  • 10,238
  • 8
  • 44
  • 83
  • 1
    Not as elegant as simply altering the table, I guess, BUT one really good thing about your approach is that you can re-run it as many times as you like without touching/destroying the source data until you're absolutely happy with the results. – Adrian K May 21 '16 at 20:13
1

If adding a primary key is not an option, then one approach would be to store the duplicates DISTINCT in a temp table, delete all of the duplicated records from the existing table, and then add the records back into the original table from the temp table.

For example (written for SQL Server 2008, but the technique is the same for any database):

DECLARE @original AS TABLE([hash] varchar(20), [d] float)
INSERT INTO @original VALUES('A', 1)
INSERT INTO @original VALUES('A', 2)
INSERT INTO @original VALUES('A', 1)
INSERT INTO @original VALUES('B', 1)
INSERT INTO @original VALUES('C', 1)
INSERT INTO @original VALUES('C', 1)

DECLARE @temp AS TABLE([hash] varchar(20), [d] float)
INSERT INTO @temp
SELECT [hash], [d] FROM @original 
GROUP BY [hash], [d]
HAVING COUNT(*) > 1

DELETE O
FROM @original O
JOIN @temp T ON T.[hash] = O.[hash] AND T.[d] = O.[d]

INSERT INTO @original
SELECT [hash], [d] FROM @temp

SELECT * FROM @original

I'm not sure if sqlite has a ROW_NUMBER() type function, but if it does you could also try some of the approaches listed here: Delete duplicate records from a SQL table without a primary key

Community
  • 1
  • 1
rsbarro
  • 26,421
  • 8
  • 69
  • 74