3

This question is very basic, but I cannot figure the error in my thinking. According to the author of the book "Pattern Recognition and Machine Learning", we can get the Probability Distribution Function of a distribution in the form of histograms by

"simply divide $n$ (the number of observation for one bin) by the total number $N$ of observations and by the width $\Delta_i$ of the bins to obtain probability values given by"

$p(x) = \frac{n_i}{N\Delta_i}$

I simply cannot get my head around how this give me the probability for a specific bin. $N\Delta_i$ is basically the sum of the area of every histogram. To calculate a relative frequency of one specific bin, why do we ignore the width $\Delta_i$ in the nominator, which would equal to the division of the area of one bin over the whole area.

kklaw
  • 515
  • 3
    By the very definition of a PDF, the $p$ are not "probability values:" they estimate probability density, which is probability per unit width. – whuber May 18 '22 at 16:23
  • 1
    I see, so basically as soon as I multiply $n_i$ by some width, I get the actual probability, which is the case I described. – kklaw May 18 '22 at 16:30
  • 1
    @kklaw Yes. You can further understand this in terms of integration, if you are familiar with calculus. – Galen May 18 '22 at 16:31
  • 2
    See https://stats.stackexchange.com/a/296602/35989 – Tim May 18 '22 at 16:50

1 Answers1

1

One style of histogram of a sample has a vertical axis called Density, scaled so that the total area of the histogram bars is unity $(1).$ Thus, suppose you have a large sample from a population with density function $f_X(x).$ Then the histogram will tend to imitate the shape of $f_X(x).$ That is, the area of a histogram bar with base $(a,b]$ of width $\Delta = b-a$ will approximate $P(a < X \le b) = \int_a^b f_X(x)\, dx.$

For example, suppose x is a sample of size $n = 1000$ from a population distributed $\mathsf{Gamma}(\mathrm{shape}=3, \mathrm{rate}=1/5).$ Then we might have one of the two histograms shown below, each along with the density function $f_X(x)$ of $\mathsf{Gamma}(3,1/5).$ In this example, the population mean is $\mu = 15, \sigma = \sqrt{75} \approx 8.660.$ (Using R, where parameter prob=T of function hist plots a density histogram, and parameter br suggests the number of bins.)

set.seed(2022)
x = rgamma(1000, 3, .2)
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.236   8.673  13.495  15.013  19.939  53.914 
sd(x)
[1] 8.488901

par(mfrow=c(1,2)) hist(x, prob=T, br=5, ylim=c(0,.06), col="skyblue2") curve(dgamma(x, 3, .2), add=T, lwd=2, col="brown") hist(x, prob=T, ylim=c(0,.06), col="skyblue2") curve(dgamma(x, 3, .2), add=T, lwd=2, col="brown") par(mfrow=c(1,1))

enter image description here

In the left panel, the bin with base $(10,20]$ of width $\Delta = 10$ contains $432$ observations, has height $.0432,$ and thus area $0.432.$ According to the density function, the probability within this interval is $0.4386.$

diff(pgamma(c(10,20), 3,.2))
[1] 0.4385731

In the right panel, the bin with base $(5,10]$ of width $\Delta = 5$ contains $432$ observations, has height $.0496,$ and thus area $0.248.$ According to the density function, the probability within this interval is $0.243.$

diff(pgamma(c(5,10), 3,.2))
[1] 0.2430222

Note: In R, some details of a particular histogram can be listed by making a non-plotted histogram (parameter plot=F.) For the first histogram above, we have the following partial printout:

hist(x, prob=T, br=5, ylim=c(0,.06), plot=F)

$breaks [1] 0 10 20 30 40 50 60

$counts [1] 321 432 196 37 12 2

$density [1] 0.0321 0.0432 0.0196 0.0037 0.0012 0.0002

$mids [1] 5 15 25 35 45 55

...

BruceET
  • 56,185