0

I'm trying to remove duplicate rows from a table with millions of rows. The field I want to check for duplicates on is too long (it's storing URLs) to put a UNIQUE index on. Is there any way to remove duplicates quickly?

The recommended method for removing duplicates:

DELETE t1 FROM table1 AS t1 JOIN table1 AS t2 ON t1.id>t2.id AND t1.name=t2.name;

Never seems to finish its job, though I suppose it might just require a lot of time to do.

One idea I've heard here is to create an MD5 hash column for indexing and comparison. Is this the recommended route? If so, should I be truncating this column for space/speed considerations?

Ben G
  • 25,319
  • 32
  • 97
  • 163

2 Answers2

0

The hash would give you a column you could put and index on so then t1.Name = t2.Name would be a far less expensive t1.Hash = t2.Hash. Adding the hash to 1,000,000 records would take a while though.

Another option if this is a one off would be to page the clear out e.g. something like

Where T1 >= 0 and T1 < 10000
Where T1 >= 10001 and T1 < 20000

etc.

Tony Hopkinson
  • 19,851
  • 3
  • 30
  • 38
0

I may not have thought this through, but it's worth a try:

  • Create a column, md5url, and fill it with the md5 of the url (UPDATE table1 SET md5url = MD5(url)
  • Make a (non-unique) index on that column md5url - md5 should be short enough to do so
  • change your statement to:

    DELETE t1 
    FROM table1 AS t1 
         JOIN table1 AS t2 
              ON t1.md5url = t2.md5url 
                 AND t1.name=t2.name 
                 AND t1.id>t2.id;
    

That way, the JOIN condition works primarily on an index. If that indexed column, md5url, fits, then we actually check the URL - because, depending on how many urls you have, it becomes possible at some point, that two URLs may have the same MD5. The third condition is clear - ensuring that only one of the two identical columns is deleted.

I would love to hear whether this works - it makes perfect sense in my head right now ;-)

Omesh
  • 26,034
  • 6
  • 40
  • 50
Lukx
  • 1,273
  • 2
  • 11
  • 20
  • It did work, but I can't say whether it went any faster than it would have w/o md5.. since it took like a whole day. Also, I wonder whether your approach actually saved time-- question is whether MySQL evaluated in a "lazy" way. See http://stackoverflow.com/questions/10747047/is-mysql-logic-evaluation-lazy-short-circuiting-in-join-clause and lemme know what you think – Ben G May 28 '12 at 22:13