4

I am trying to implement C4.5 and I use Quinlan's book called also "C4.5: Programs for Machine Learning". The pruning method proposed by C4.5 is to use a pessimistic CI-based estimate of the error from a node. So, given a confidence level ($\alpha$), the total weight of the cases considered $N$, and the total weight of the mis-classified cases $E$, the proposed estimator is the upper limit of the confidence interval for binomial distribution with $N$ trials and $E$ errors.

I fully understand that we are not talking here about a sample, that the estimated value is more or less a heuristic, and so on. I am not interested in this, I try to met all the original specifications.

The problem which I do not understand is the following one: due to how the missing values are handled by C4.5, it is usual to have quantities $N$ and $E$ as non-integer values. If this is the case, how can one compute CI for binomial?

Note: Now I am currently studying this article: http://www.sigmazone.com/binomial_confidence_interval.htm to give me some light on this topic. Still, I believe that this will not solve my problem.

steffen
  • 10,367
rapaio
  • 6,974

1 Answers1

3

From a bayesian point of view the distribution of p with k empirical successes and n trials is the Beta-Distribution, in detail $p\sim Beta(a,b)$ with $a=k+1$ and $b=n-k+1$. It represents the unnormalized density $prob(p|data)$, i.e. the unormalized probability that the unknown parameter is $p$ given the data (successes and trials) you have seen so far. (copied from my answer to this question).

Since the parameters $a,b$ are both real, it can also be applied in your special case, just substitute $k$ with $E$ and $n$ with $N$ (assuming that $E\le N$ is guaranteed)

The upper bound of the two-sided credible interval (as it is called in the bayesian case) is calculated as $qbeta(1-alpha/2,a,b)=1-qbeta(alpha/2,b,a)$ (source) where $qbeta$ represents the quantile or inverse cumulative distribution function for Beta.

steffen
  • 10,367
  • Thank you for crystal clear explanation. Further study on your explanations leads me to Clopper-Pearson method. There is however a difference: the definition of regularized beta function seems to use $a=k+1$ and $b=n-k$ (not $n-k+1$). I do not know how is correct. However I have an implementation of the inverse of regularized beta function and it seems to give the same results as in Quinlan's book, using formula with $b=n-k$. – rapaio Jan 02 '14 at 23:07