Flexible and robust way to cull older data?

Question

I am no DBA and that is becoming more apparent daily, as such I am having trouble keeping my query times reasonable. I have about 3.57 Million records ( at about 3.43 GB ) and it takes about 2 minutes to run simple queries. I have done the simple things that I know to do like indexing but most of the data I don't need to keep around and so I was thinking I could have a field called 'days2keep' in which certain entries that I need to keep for longer could be noted in the record itself. I would then have a cron job run daily or weekly to cull any records older than the 'days2keep' field. I have looked around and I have not seen any references to a similar idea. Does it make sense to do it this way or are there pitfalls that I am unaware of?

FWIW: I don't think that it matters but I am considering switching from MySQL to PostgreSQL.

update

After looking around it seems like the more common approach is to do horizontal partitioning of the data but I only need to keep some records for much longer so it seems like my original solution would be best. I am just a little wary of spending the time implementing it without some feedback from more experienced users.

TODO

Thanks to everyone for the insights below. The biggest take away here is that I still have a LOT to learn and understand so here are the steps I plan to take:

1) Analyze my queries to make sure I understand why they are taking so long.
2) Look into multi-column indexing to see if that might solve my problem.
3) Figure out the type of partitioning I might consider (horiz. not an option).
4) Figure out which solution keeps the database management and complexity to a minimum (esp. in regards to backup/restore and replication to dev space). I am using in Django so up until now it has been fairly easy to manage (for a non-DBA person).
5) Pick a solution and implement
6) Ask for a raise
7) drink beer

Rather than days to keep, store an expiration timestamp/date. — Craig Ringer, Jul 13 '15 at 23:23
If you don't need to read all the rows everytime you could make cronjobs with some expiration timestamp (as @CraigRinger suggested) move the information to another table with the same structure and you could use UNIONs to make reports or other SELECTs. — oNare, Jul 13 '15 at 23:32
It sure depends on your queries but 2 minutes seem like a tablescan or badly used index scan/big range, are you sure your indexes are right? Do you use multicolumn indexes? — jkavalik, Jul 14 '15 at 06:25
You say "I have looked around and I have not seen any references to a similar idea". Take a look here (posted only 2 hours after your own question) for an idea of how it might be done. This example is for PostgreSQL, but the principle is the same. Take a look at Erwin Brandstetter's rep - if he considers it a valid solution to a problem, then it probably is. This concept comes up all the time here - you are not alone :-). BTW, if I were you, I'd go with PostgreSQL - far better database. — Vérace, Jul 14 '15 at 09:02
@Craig: Why? One reason I could see is that if you are looking at the raw record it would be easier to see when it going to be deleted but is there other reasons why this would be better? oNare: I will read up on that. jkavaalik: Thanks for the insight I need to get better/start analyzing my queries to see why they are non-performant. Since I do not know what multi-column indexes are it seems like I still have a bit of homework to do. Vérace: Partitioning seems like it could be a good solution but it seems a little more complex than a cron to DELETE. — stephenmm, Jul 14 '15 at 17:05
honestly, 3.5M rows should not be a problem if properly indexed. — timpone, Jul 14 '15 at 18:58
@stephenmm If you use a timestamp then figuring out what to delete is a simple index scan. If you store days to retain (and presumably a creation timestamp) then more processing is needed to determine what to delete. — Craig Ringer, Jul 14 '15 at 23:23

Flexible and robust way to cull older data?

0 Answers0