0

In Andrew Ng's Coursera class on Machine Learning, we learned to use a Gaussian distribution

$p(x)=\prod^n_{j=1}p(x_j,μ_j,σ^2_j)$

to detect anomalous examples when $p(x)<\epsilon$ where $x_j$ are independent features of some test example, and $\mu_j$ and $\sigma^2_j$ are the mean and standard deviation from a training set consisting of non-anomalous examples, and $\epsilon$ is a threshold chosen from a cross-validation set.

I have a data set where I know all of the anomalous examples. I randomly assigned 60% of the good examples to the training set to calculate $\mu_j$ and $\sigma^2_j$ and 40% of the good examples to a cross-validation set to calculate the value of $\epsilon$ that minimizes the $F$ Score. All of the bad examples went into the cross-validation set. I was able to determine the particular features $x_j$ to use based on my knowledge of the problem and the results of the cross-validation set.

Now, I need to provide $\mu_j$ and $\sigma^2_j$ and $\epsilon$ as official values for screening all incoming parts in the manufacturing process at our company.

The problem is there is some variability in $\epsilon$ (and the other statistics) based on how I randomly divide between the training and cross-validation set. I have read about K-fold cross-validation and bootstrap cross-validation.

If I use either of these methods to generate a set of training sets and cross-validation sets, what parameters $\mu_j$ and $\sigma^2_j$ and $\epsilon$ should I report as the official numbers? The average over all partitions?

Should I even generate a set of training sets and cross-validation sets? Should I just pick one with the highest $F$ score and be done?

Például
  • 3
  • 2

1 Answers1

0

Cross validation is a good method. Your final $\mu$ and $\sigma_j^2$ should be based on the whole data set.

  • So, I generate a bunch of training set, cross-validation set pairs, and use the cross-validation sets to choose an $\epsilon$, and then base my final $\mu_j$ and $\sigma^2_j$ on the entire dataset? – Például Jun 12 '15 at 00:14
  • 1
    Can you expand on this a little? – gung - Reinstate Monica Jun 12 '15 at 00:58
  • 1
    @gung I am guessing from further reading that I will split my data set into k-folds. For each of the k-folds, I will compute $\mu_j$ and $\sigma^2_j$ from the remaining k-1. For each value of $\epsilon$ in a given range, I will compute the average of the F scores on the cross-validation sets, and choose the threshold $\epsilon$ which gives the highest average F score. That will be the threshold that I report. Then I will recompute $\mu_j$ and $\sigma^2_j$ on the entire and data use these as my official $\mu_j$ and $\sigma^2_j$. – Például Jun 12 '15 at 01:26
  • @Például Yes, that's the typical method. – Danica Jun 12 '15 at 04:11
  • @Kristofersen, why don't you expand your answer. As it stands now it is correct (at least the second sentence -- the first sentence is a bit off topic) but not very useful. – Richard Hardy Jun 12 '15 at 08:25