0

Example: I have a list of 20,000 dates, such as "X was modified on", or "X was created on". But it's possible one of the dates is being used as a "default" or "null" value.

How would I find the default/null value without already knowing it?

To frame the question another way:

  • How do I determine if one of the dates is occurring unusually often?

  • Or, how do I determine that the quantity of one date is unusual?

It seems like the probability density function should fit in here somewhere. Is there a way to determine the distribution of the dates without knowing beforehand if they're "normal" or "uniform"?

Sam Porch
  • 205
  • If you know that null corresponds to 01.01.1990, then simply filter it out. If you don't know which date it is, one simple way (that assumes null is the most frequent date) would be to calculate a histogram of dates (so an occurence vector with T members, where T is the number of unique dates), and choose the bin with the highest value. Or you can run a clustering algorithm and select the cluster with the biggest size, which will probably correspond to -more or less- the same thing. – jeff Oct 22 '15 at 01:31
  • 1
    I'd also start with exploratory techniques. What are the frequencies of the 20 most popular dates? Look at the top 4 or 5 and see if they look weird in some obvious way. – Glen_b Oct 22 '15 at 03:23
  • 1
    definitely, filling in default dates is a common praxis. how many unique dates do you have? if they are not too many, e.g. they are < screen width in pixel, you should be able to spot anomalies by sight on a frequency plot. well, of course, this will work only if the underlying data in regular enough;) also, it may be obvious, but plot frequency of sorted dates, to keep the time series structure. it should be there (eg weekends). if you are lucky the default date is 1-jan, or other holiday, where very few people work. – Nicola Mingotti Sep 16 '19 at 10:45
  • A relevant example is at https://stats.stackexchange.com/questions/80738/what-is-the-probability-that-a-person-will-die-on-their-birthday/336676#336676, but there the outlier (default) is known. The plot there hints at one possibility: groups of clustered outliers might be real, while a totally isolated outlier hints at something else ... – kjetil b halvorsen Jun 27 '22 at 16:30

0 Answers0