0

I have a table where I need to delete any rows that have the same url and parent_url. The table has the following columns:

id, time, url, parent_url, source

with duplicates as follows:

1,12-24-2014,asdf.com, http://repeat.com, email
1,12-24-2014,zxzy.com, http://secondrepeat.com, email
2,12-26-2014,asdf.com, http://repeat.com, email
3,12-26-2014,zxzy.com, http://secondrepeat.com, email
4,12-27-2014,asdf.com, http://repeat.com, email

These would need to be reduced to the following 2 rows (having deleted 3 rows):

1,12-24-2014,asdf.com, http://repeat.com, email
1,12-24-2014,zxzy.com, http://secondrepeat.com, email

I have beeing trying to use a for-loop and have been looking for queries that follow this format, and have not found anything that has shed much light on how this count be done.

For example, I tried doing

DELETE FROM scraper WHERE (url, parent_url) IN (SELECT(url, parent_url) FROM scraper GROUP BY url, parent_url having count(*) > 1);
maudulus
  • 9,711
  • 9
  • 72
  • 108

1 Answers1

3

You can use not exists, for instance:

delete from scraper
     where exists (select 1
                   from scraper s2
                   where s2.url = scraper.url and s2.prent_url = s.parent_url and
                         s2.time < s.time
                  );

Or, you can adapt the referenced solutions for something like this:

delete from scraper
     where id not in (select min(id)
                      from scraper s2
                      group by s2.url, s2.prent_url 
                     );

It is unclear from the question whether time or id provides a better choice for the min().

Gordon Linoff
  • 1,198,228
  • 53
  • 572
  • 709