5

I've read these SO questions and this is not duplicate of them


I've collected statistical data about people

Data hierarchy is like this:

  • Region
    • Street
      • Building number
      • Entrance number
        • [Statistical package]

[Statistical package] contains (in this example)

  • floor (stock) number
  • UUID (defining flat)
  • Religion
  • Appearance of toilete

What algorithm or procedure should I use to discover anomalies like: or What statistical programming framework should I use?
(including what is best underlaying technology - like SQL or Document oriented DB, interpreted or compiled language, and so on)

1-a :: Only one floor (of every floors in building) has no toilets
1-b :: One flat (UUID) has no toilet although all other flats in entrance/building has at least on
2-a :: There is one flat claiming Religion X although whole Region has Religions Y and Z
2-b :: There is one building claiming Religion X although whole Region has Religions Y and Z

But this is only example on limited number of Statistical package attributes, I should find many types of anomalies on around 15 attributes in every Statistical package

Note: this question is not about how should I find anomalies for provided examples, those examples are just illustrative, I'm looking for common solution/algorithm

Thanks beforehand for any response

  • If you want your question to be migrated, flag it for moderator attention and write it into the text field. (This is preferable to reposting it.) –  Aug 31 '11 at 01:49
  • I second that: I recommend moving to stats.SE. – Iterator Sep 02 '11 at 20:35

1 Answers1

5

I would use a relational database that has OLAP features, arranging the data in a star schema like so:

Fact: UUID
Dimensions: Region, Street, Building number, Entrance number, Floor (stock) number, Religion, Appearance of toilete

Then I would make a view over it with a large number of features, average religion per region, per building, appearance of toilet per floor/building ... etc.

Vector: UUID, Dimensions: Region, Street ..., Features: average per X, max per Y ... etc

Now I have a big vector space to witch I can easily apply common anomaly detection algorithms.

For example let's say that training data size (m) < 10 * number of features (n) and we are on a reasonable powered computer to apply multivariate Gaussian probability density estimation.

For our training vectors

\begin{align*} {x^{(i)}} \in \mathbb{R}^n, i \in 1..m \end{align*}

Our probability function is: \begin{align*} p(x, \mu, \Sigma)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{n/2}}exp\bigg(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\bigg) \end{align*}

So we need to fit the parameters:

\begin{align*} \mu=\frac{1}{m}\sum_{i=1}^mx^{(i)} \space , \space \Sigma=\frac{1}{m}\sum_{i=1}^m(x^{(i)}-\mu)(x^{(i)}-\mu)^T \end{align*}

Now, that we can compute $p(x, \mu, \Sigma)$ we can flag a fact as anomalous if:

\begin{align*} p(x, \mu, \Sigma)<\epsilon \end{align*}

By varying $\epsilon$ we will enlarge/restrict our anomalous facts class, and for really small values of $\epsilon$ we will find the most far away outliers (assuming there are any).

All there is to do now is vary $\epsilon$ and analyze different results.

clyfe
  • 790