pdf vs probability vs likelihood

Question

How to compute the log likelihood?

Let's take a simple example using a normal distribution and scipy to do the work. Assuming X is the data, and the normal distribution as the model (dist = scipy.stats.norm(X.mean(), X.std())).

From what I read, the log likelihood is supposed to be dist.logpdf(X).sum(), in other words the sum of the logs of the probabilities of each of the samples in X.

The part that puzzles me in this is that the PDF is not the probability: the integral of the PDF sums to 1, but pdf(x) for an individual value x can be larger than 1 (for instance scipy.stats.norm.pdf(x=0, loc=0, scale=0.1) is almost 4).

I’m sure there are tons of posts on this here. Have searched the site? — utobi, Nov 10 '23 at 11:52

user3728501 · Accepted Answer · 2023-11-13T11:17:37.087

You are absolutly correct that strictly speaking a pdf is not a probability. But this does not matter. All that matters is the pdf is larger where it is "more probable". In other words, where the likelihood is largest.

If you wish to have a more mathematically motivated reasoning for what a maximum likelihood estimator is, you could consider that the probability is obtained by

$$\mathrm{pdf}(x) \cdot dx$$

where $dx$ is the usual infinitesimal. Now this suggests your probability $\to 0$, which is again correct. But you can factor out all the little $dx$'s, and you just get a constant in front of your likelihood expression. Take the log of that and you just get a constant, which you can throw away.

I personally think the concept of a log likelihood is easier understood in the context of a Maximum Likelihood Estimator. (MLE)

Consider your Gaussian example. (Maybe even try this yourself in Python.)

Generate a set of numbers drawn from a Gaussian distribution with mean 0 and standard deviation 1.
Consider that the most probable values drawn from this random number generating process will be clustered around the mean, at 0. There will probably be some in the tails, but these values will occur less often, statistically. (Unless you are extremely unlucky and your distribution happens to be all tail values. We will discuss this later.)
To calculate the MLE, assume some initial values for the mean and standard deviation parameters. A sensible choice would be either measure them from your data using the usual formula for mean and variance, or just choose values 0 and 1 as before.
Calculate the Likelihood value, or Log-likelihood value for the particular set of data you have, in the way you describe in your question.
Now, vary the parameters for your model. Vary the mean and variance. Re-calculate the likelihood, and maximize it in an iterative process, or calculate the log likelihood, and maximumize that in an iterative process.

When you have found the parameter values which maximize the likelihood or log-likelihood value, these are your best estimate of the mean and standard deviation. They will not be 0 and 1.

A Real Example

Let's start by generating a batch of numbers drawn from a Gaussian distribution. Here the batch size is 10.

The values in the histogram look plausibly Gaussian distributed. We can immediatly see the mean will not be exactly zero in this statistical sample.

Let's measure the mean and variance in the usual way.

mean=0.234
stddev=0.978

Now let's compute some likelihood values.

|   method |  mean |   std | likelihood |
| -------- | ----- | ----- | ---------- |
|    fixed |   0.0 |   1.0 |   6.50e-07 |
| measured | 0.234 | 0.978 |   8.59e-07 |
| optimize | 0.234 | 0.978 |   8.59e-07 |

To explain each of these rows. The first row is the calculation with a fixed mean of 0 and a fixed variance of 1. It produces the worst result. The second row uses the measured values of mean and variance, calculated from the .mean() and .std() numpy functions. The final row is supposed to be an optimization of the MLE. It doesn't do anything yet because the minimization procedure fails due to lack of numerical precision. *This shows why minimizing the negative log likelihood is the prefered method over minimizing the negative likelihood.

Note: In Python (scipy) we can only minimize functions. There are no routines for maximizing things instead, so we minimize the negative. Other libraries in other languages might be different.

Now, let's replace the likelihood with the log likelihood and repeat the procedure.

|   method |  mean |   std | likelihood |
| -------- | ----- | ----- | ---------- |
|    fixed |   0.0 |   1.0 |   -14.2461 |
| measured | 0.234 | 0.978 |   -13.9679 |
| optimize | 0.234 | 0.978 |   -13.9679 |

This time the fit does actually converge. (It performs more than just a few iterations.) But the results are the same.

Interestingly, we find that the values of the parameters calculated from the formula for mean and standard deviation agree exactly with the optimization result. I must admit, I didn't fully anticipate this being the case, as in many applications with real data this will not be the case.

We can now plot our results as well.

Finally, if you repeat this with a different seed for the random number generator, you will obtain different results.

Code

#!/usr/bin/env python3
import math
import numpy
import matplotlib
import matplotlib.pyplot as plt
import scipy
def calculate_likelihood(data, mean, stddev):
pdf_values = scipy.stats.norm.pdf(data, mean, stddev)
#print(f'mean={mean}, stddev={stddev}')
#print(f'pdf: {pdf_values}')
likelihood = pdf_values.prod()
return likelihood



def calculate_log_likelihood(data, mean, stddev):
pdf_values = scipy.stats.norm.pdf(data, mean, stddev)
log_pdf_values = numpy.log(pdf_values)
#print(f'data: {data}')
#print(f'pdf_values: {pdf_values}')
#print(f'log_pdf_values: {log_pdf_values}')
log_likelihood = log_pdf_values.sum()
return log_likelihood



def optimize_function(p, data):
mean = p[0]
stddev = p[1]  

#return -calculate_likelihood(data, mean, stddev)
return -calculate_log_likelihood(data, mean, stddev)



def main():
# draw a batch of numbers from a Gaussian distribution
mean = 0.0
stddev = 1.0
batch_size = 10
numpy.random.seed(5)
random_numbers = numpy.random.normal(mean, stddev, batch_size)

# plot random numbers, using fine histogram binning
figure = plt.figure()
ax = figure.add_subplot(1, 1, 1)
ax.hist(random_numbers, bins=200, range=(-5 * stddev, 5 * stddev))
figure.savefig('gaussian_batch.png', dpi=300)

measured_mean = random_numbers.mean()
measured_stddev = random_numbers.std()

print(f'{measured_mean}')
print(f'{measured_stddev}')

#x0 = [measured_mean, measured_stddev]    
x0 = [0.0, 1.0]    
optimize_result = scipy.optimize.minimize(optimize_function, x0, method='BFGS', args=(random_numbers), options={'disp': True}, tol=1.0e-10)
likelihood_value = -optimize_result['fun']
optimized_mean = optimize_result['x'][0]
optimized_stddev = optimize_result['x'][1]

print(f'{0.0}, {1.0}, {calculate_log_likelihood(random_numbers, 0.0, 1.0)}')
print(f'{measured_mean}, {measured_stddev}, {calculate_log_likelihood(random_numbers, measured_mean, measured_stddev)}')
print(f'{optimized_mean}, {optimized_stddev}, {calculate_log_likelihood(random_numbers, optimized_mean, optimized_stddev)}')

# plot results
figure=plt.figure()
ax = figure.add_subplot(1, 1, 1)
x = numpy.linspace(-5 * stddev, 5 * stddev, 1000)
p1 = ax.plot(x, batch_size * scipy.stats.norm.pdf(x, 0.0, 1.0))
p2 = ax.plot(x, batch_size * scipy.stats.norm.pdf(x, measured_mean, measured_stddev))
p3 = ax.plot(x, batch_size * scipy.stats.norm.pdf(x, optimized_mean, optimized_stddev))
(_, _, p4) = ax.hist(random_numbers, bins=200, range=(-5 * stddev, 5 * stddev))
ax.legend(handles=[p1[0], p2[0], p3[0], p4[0]], labels=['original', 'measured', 'optimized', 'data'])
figure.savefig('gaussian_batch_fit_results.png', dpi=300)



if name == 'main':
    main()

Edit: Addendum - Why Maximum Likelihood produces the same values for $\mu$ and $\sigma$ as measuring using the standard formulas:

I had a further thought about this, but didn't quite get to the end of a mathmatical proof.

It is perhaps not surprising that the Maximum Likelihood method produces the same values for the mean and standard deviation as measuring using the regular formulas.

$$\mu=\frac{1}{N}\sum_{i}x_i$$

$$\sigma=\frac{1}{N-1}\sum_{i}\left(x_i-\mu\right)^2$$

The reason being that to maximize a function, one finds solutions to an equation for a derivative which is equal to zero.

The likelihood function has this form

$$\prod_i\;\mathrm{pdf}(x_i;\mu,\sigma)$$

which is essentially a set of exponential functions multiplied together. The same results are obtained from multiplying the log likelihood, in which case we have a series of exponential sums:

$$\sum_i\;\ln(\exp(\frac{x_i-\mu}{\sigma})^2)$$

or

$$\sum_i\;\left(\frac{x_i-\mu}{\sigma}\right)^2$$

we should be able to take partial derivatives of this to get back the usual equations for $\mu$ and $\sigma$. I wasn't quite able to complete this but I think this is the general principle which should be followed.

I added this as further information to point someone in the right direction should they wish to follow the issue and try to obtain a concrete proof of this.

thanks for the thorough answer. The key part that I missed is that the dx can be thrown away at the end... This makes perfect sense now. — brice rebsamen, Nov 13 '23 at 10:14

pdf vs probability vs likelihood

1 Answers1

A Real Example

Code

Edit: Addendum - Why Maximum Likelihood produces the same values for $\mu$ and $\sigma$ as measuring using the standard formulas: