We had an issue relating to duplicated data in our database, with a date field having multiple values where we were meant to have 1. I thought I'd add the way we resolved the issue for reference.
We have a collection called "data" with a numeric "value" field and a date "date" field. We had a process which we thought was idempotent, but ended up adding 2 x values per day on second run:
{ "_id" : "1", "type":"x", "value":1.23, date : ISODate("2013-05-21T08:00:00Z")}
{ "_id" : "2", "type":"x", "value":1.23, date : ISODate("2013-05-21T17:00:00Z")}
We only need 1 of the 2 records, so had to resort the javascript to clean up the db. Our initial approach was going to be to iterate through the results and remove any field with a time of between 6am and 11am (all duplicates were in the morning), but during implementation, made a change. Here's the script used to fix it:
var data = db.data.find({"type" : "x"})
var found = [];
while (data.hasNext()){
var datum = data.next();
var rdate = datum.date;
// instead of the next set of conditions, we could have just used rdate.getHour() and checked if it was in the morning, but this approach was slightly better...
if (typeof found[rdate.getDate()+"-"+rdate.getMonth() + "-" + rdate.getFullYear()] !== "undefined") {
if (datum.value != found[rdate.getDate()+"-"+rdate.getMonth() + "-" + rdate.getFullYear()]) {
print("DISCREPENCY!!!: " + datum._id + " for date " + datum.date);
}
else {
print("Removing " + datum._id);
db.data.remove({ "_id": datum._id});
}
}
else {
found[rdate.getDate()+"-"+rdate.getMonth() + "-" + rdate.getFullYear()] = datum.value;
}
}
and then ran it with mongo thedatabase fixer_script.js