0

The question I am asking has its reference listed below:

Since I am learning probability properly for the first time, there is a lot of confusion that I want to clarify. The journey from discrete to continuous random variables and distributions is not smooth (no pun intended) at all. I felt like even after reading so many posts, I cannot fully grasp it. My main confusion lies in the interpretation of what a Probability Density Function at point $x$ is, here's my own thought process, it may not be rigorous, but I'd appreciate it if someone can tell me that I am going in the right direction or not.


Problem

We begin with an example:

The masses of 200 six-month-old babies are collected. To visualise the distribution of the mass of a six-month old baby (which is a continuous random variable), let us for a moment treat this 200 data points as the probability distribution$^1$ of all the babies' masses. Then, we can say that $X$ is a random variable that represents the mass of a (randomly drawn) baby.

Density Histogram

The dataset has 200 samples.

We can then group the data into 5 classes with a class width of 1 kg each. The frequency table and the histogram corresponding to the data are shown below. enter image description here enter image description here

We try to link back how a density histogram can connect to PDF and its integral. Let $X$ be the continuous random variable defined earlier.

  1. The area under a density histogram is 1, this follows because the relative histogram must add to 1.
  2. The area represents how dense the population is at that particular interval/bin. The larger the area, the denser the population is (think density).
  3. For each interval/bin, the area inside that rectangle is simply the relative frequency (i.e. $\text{width of interval} \times \text{density}$).
  4. Therefore, we can think of it this way, the area inside each interval/bin is the probability of $X$ being in that interval.
  5. Consequently, we can define $\mathbb{P} [a < X \leq b]$ to be the area inside the interval between point a and b, i.e. $\mathbb{P} [a < X \leq b] = (b-a) \times \text{density}$

Note we defined this without the usage of integrals.


Now this definition seems quite coarse because we are only restricted with a fixed interval, i.e. bins 5-10 with only 1 width, and we can only find cases for these discretized bins.

What if we want to find $\mathbb{P}[5 < X < 5.25]$? Well, we can discretize our bins further to have 0.25 intervals (width)!

enter image description here

Now the problem is that all the bins are discretized, and hence does not fit the definition of continuous, we can keep ask for a smaller interval on the real line (i.e. what is $\mathbb{P}[5.00000001 < X \leq 5.00000002]$). Then we have to keep shrink the bin width to recover the area of the rectangle in order to recover the relative frequency (probability).

Define the number of bins to be $k$, and the width of a bin to be $\frac{5}{k}$ where $5$ is $10 - 5$, then we can solve this problem by letting $k \to \infty$, this will eventually smooth out the whole histogram into infinitely number of bins and recover a smooth function $f$, which turns out to be our PDF.

We can use seaborn to plot something like this with 20 bins.

enter image description here

PDF at a point $x$

Finally, I try to connect back on what PDF at a point $x$ is, and attempt to explain why PDF is also the probability per unit length, as well as why it is a "density".

We now ask, what exactly does PDF mean at a point $x$ mean, in other words, if $f_X(x)$ is not the probability, then what is it? We established that the probability at a point $x$ is $0$, because integrating over a point (line) gives 0 (also can prove by contradiction by summing up infinite number of bins).

But to me it does not make sense if this number has no meaning. The textbook says that $f_X$ at a point $x$ is the "probability mass per unit length$^2$ around $x$ within a small neighbourhood $\delta$".

This means that if we define an infinitesimally small interva epsilon $\delta$ around $x$ (here we would just use $x^{+}$) to $x$, we have an interval $(x, x+\delta]$. We then ask ourselves the probability between this interval, which is $\mathbb{P}[x < X \leq x + \delta]$.

Now recall we have showed earlier that we can approximately interpret integration as sums, and this is now connected to the histograms we plotted. If the interval/bin width is very small, then we can say that the probability of this interval is $\mathbb{P}[x < X \leq x + \delta] \approx \delta f_X(x)$. Note carefully that the interval must be small, if not the "area" won't be close to the probability in that interval.

Rearranging will get me:

$$ \mathbb{P}[x < X \leq x + \delta] \approx \delta f_X(x) \implies f_X(x) \approx \dfrac{\mathbb{P}[x < X \leq x + \delta]}{\delta} $$

Then we can see that the PDF at a point $x$ is the probability per delta, which translates to the probability mass per unit length around $x$ within the small neighbourhood $\delta$. One can think of it as how densely packed $f_X$ is around the point $x$, if $f_X(x)$ turns out to be larger than the rest, this means that the probability of occurring at that point (and its neighbourhood) is larger than the rest. More concretely, for the same interval length $\delta$, a larger $f_X$ means the probability is higher.

Furthermore, the probability density function can be greater than 1, as long as it obeys that the area under the curve is 1.

Note that the above did not mention about the definition of $f_X$ in terms of integrals, we complete the intuiton as follows:

$$ \mathbb{P}(x \le X \le x+\delta) = \int_x^{x+\delta} f_X(x) \, dx \approx \delta f_X(x) \approx dx f_X(x) $$

Am I in the right direction if I interpret this way?

Appendix

The code to generate the dataset is:

# generated bins of data from 5-10
bin_56 = np.random.uniform(5, 6, 20)
bin_67 = np.random.uniform(6, 7, 48)
bin_78 = np.random.uniform(7, 8, 80)
bin_89 = np.random.uniform(8, 9, 36)
bin_910 = np.random.uniform(9, 10, 16)
population = np.concatenate((bin_56, bin_67, bin_78, bin_89, bin_910))

Footnotes

1: This means per 1 unit. 2: Even though this is the empirical distribution (observed, collected data) from the true population (PDF), we will assume that these 200 data points is our true population since I am trying to understand PDF.

Richard Hardy
  • 67,272
nan
  • 825
  • 2
    Yes that's the right way to think of it: with continuous variables, we can't get the probability of any individual event, only the probability of intervals $[a,b]$, and we get this probability by integrating the pdf over [a,b] (a slight generalization of what you have here). – John Madden Nov 05 '22 at 13:30
  • 2
    All your histograms are (piecewise) continuous density estimates, not discrete. Further, you should be careful not to mentally conflate a histogram of a sample with the population density you were (at least notionally) sampling from. – Glen_b Nov 05 '22 at 15:09
  • 1
    https://stats.stackexchange.com/questions/4220 – whuber Nov 05 '22 at 17:59
  • @Glen_b thanks, are you trying to say that I should not confuse a PDF from a histogram since the former is the true population whereas the latter is an empirical one sampled from the PDF? – nan Nov 06 '22 at 03:36
  • In essence, yes. (Note also that a histogram is not the only possible was to estimate a density.) – Glen_b Nov 06 '22 at 11:21
  • Thanks, I bear that in mind when crafting out this example, I should say that "200" is the "true population" so that the definitions match better. Besides this, is my understanding more or less correct as I attempt to interpret what PDF at a point x is. – nan Nov 06 '22 at 11:52

0 Answers0