Detect statistical anomalies

Question

I've read these SO questions and this is not duplicate of them

I've collected statistical data about people

Data hierarchy is like this:

Region
- Street
  - Building number
  - Entrance number
    - [Statistical package]

[Statistical package] contains (in this example)

floor (stock) number
UUID (defining flat)
Religion
Appearance of toilete

What algorithm or procedure should I use to discover anomalies like: or What statistical programming framework should I use?
(including what is best underlaying technology - like SQL or Document oriented DB, interpreted or compiled language, and so on)

1-a :: Only one floor (of every floors in building) has no toilets
1-b :: One flat (UUID) has no toilet although all other flats in entrance/building has at least on
2-a :: There is one flat claiming Religion X although whole Region has Religions Y and Z
2-b :: There is one building claiming Religion X although whole Region has Religions Y and Z

But this is only example on limited number of Statistical package attributes, I should find many types of anomalies on around 15 attributes in every Statistical package

Note: this question is not about how should I find anomalies for provided examples, those examples are just illustrative, I'm looking for common solution/algorithm

Thanks beforehand for any response

If you want your question to be migrated, flag it for moderator attention and write it into the text field. (This is preferable to reposting it.) — , Aug 31 '11 at 01:49

score 5 · Answer 1 · answered Jan 02 '12 at 12:04

I would use a relational database that has OLAP features, arranging the data in a star schema like so:

Fact: UUID
Dimensions: Region, Street, Building number, Entrance number, Floor (stock) number, Religion, Appearance of toilete

Then I would make a view over it with a large number of features, average religion per region, per building, appearance of toilet per floor/building ... etc.

Vector: UUID, Dimensions: Region, Street ..., Features: average per X, max per Y ... etc

Now I have a big vector space to witch I can easily apply common anomaly detection algorithms.

For example let's say that training data size (m) < 10 * number of features (n) and we are on a reasonable powered computer to apply multivariate Gaussian probability density estimation.

For our training vectors

\begin{align*} {x^{(i)}} \in \mathbb{R}^n, i \in 1..m \end{align*}

Our probability function is: \begin{align*} p(x, \mu, \Sigma)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{n/2}}exp\bigg(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\bigg) \end{align*}

So we need to fit the parameters:

\begin{align*} \mu=\frac{1}{m}\sum_{i=1}^mx^{(i)} \space , \space \Sigma=\frac{1}{m}\sum_{i=1}^m(x^{(i)}-\mu)(x^{(i)}-\mu)^T \end{align*}

Now, that we can compute $p(x, \mu, \Sigma)$ we can flag a fact as anomalous if:

\begin{align*} p(x, \mu, \Sigma)<\epsilon \end{align*}

By varying $\epsilon$ we will enlarge/restrict our anomalous facts class, and for really small values of $\epsilon$ we will find the most far away outliers (assuming there are any).

All there is to do now is vary $\epsilon$ and analyze different results.

oh, that's amazing nice. I'm going to test it, but before I find out, you deserve at least upvote ;-) Thanks — Marek Sebera, Jan 02 '12 at 12:27

Detect statistical anomalies

1 Answers1