-1

I have a table currently with 100M+ rows of telemetry data. The table often receives duplicate records.

I can identify duplicate records using a row_number() partition and estimate there are approximately 12M duplicate rows.

What is the option to delete the 12M rows?

  • Insert non-duplicates into a new table and drop existing table, recreate indexes?
  • Using a join to delete the duplicates?
  • Delete in batches? What size?
  • Do I drop indexes first?

There are no production requirements to keep the db online.

Thanks for your help.

lemon
  • 2,990
  • 8
  • 28
goodkent
  • 1
  • 2
  • 1
    This might be a better question for https://dba.stackexchange.com/ – DeanOC May 13 '22 at 21:32
  • The best way would be to restore a copy of the database and test the methods you suggest. Regardless of what anyone tells you here you certainly need to try it first so you might as well start of trying all your options. – Dale K May 13 '22 at 21:42
  • 1
    Also see [Break large delete operations into chunks](https://sqlperformance.com/2013/03/io-subsystem/chunk-deletes) (including the comments). – Aaron Bertrand May 13 '22 at 21:46
  • As far as the "using a join", take a look at https://stackoverflow.com/questions/6025367/t-sql-deleting-all-duplicate-rows-but-keeping-one/6025929#6025929. It's not obvious that you can delete from a CTE, but you can. That said, it might make sense to determine the IDs of the dupes in a separate select operation, stash those in a temp table, and use that to drive your delete. As @DaleK suggests, try it out on a copy of your db. – Ben Thul May 16 '22 at 16:32

0 Answers0