7

Here is a block diagram which I'm using when I want to verify that my model is real.

enter image description here

In each round a fold of 11/12 percent of the data is used to bulid the model (e.g. eigenvectors of the PCA)

After 12 rounds I check that the models (e.g the eigenvectors of the PCA) from each round are not statistically different and if they indeed don't I declared the model as stable. Where the idea is that if a model is stable in a k-fold sense it is some sort of indication that the model is real

I more or less thought about it myself (mainly on the last verification step) thus, I would like to know what do you think about it ? are you aware of other ways to "verify that my model is real" ?

Dov
  • 1,810

1 Answers1

5

The stability of your results depends on the number of data points you use to estimate your model's parameters, not so much on whether your model captures reality.

Take, for example, a simple univariate Gaussian distribution with a fixed variance and varying mean (a statistical model). The variance of the estimated mean will go down as 1/N, where N is the number of data points in your training set.

For statistical models, a distance measure of great interest is the Kullback-Leibler divergence between your model's distribution and the true distribution of the data. Unfortunately, the KL divergence requires knowledge of the true distribution and is therefore not very practical. An alternative would be the differential log-likelihood (see Mixture density modeling, Kullback–Leibler divergence, and differential log-likelihood, van Hulle, 2004). But there is an infinite number of possible distance measures which you could use. Which one you should choose depends on what you are going to use the model for.

Lucas
  • 6,162