What are the main difference between a QQ plot and a probability plot for measuring nomality?

Question

I am trying to evaluate the normality of the distribution of my model's residuals.

I have been using statsmodels.api.qqplot and sklearn.stats.probplot in Python, but they both produce different axes giving different impressions when visually inspecting the "closeness" of the distribution to normal distribution.

The sklearn.probplot library plots the residual value against theoretical quantiles, whereas the statsmodels.qqplot plots the sample quantile against theoretical quantiles.

I am unsure of the relative merits / deficiencies / uses of both plots, and the literature online seems to use P-P, probability plot and Q-Q plot interchangeably. Additionally, there are a number of posts suggesting use of the sklearn.probplot for plotting QQ plots.

If I use the sklearn plot, my data seems visually very close to the line of normal distribution, however it looks far from close using statsmodels plot.

What are the relative merits of each for measuring normality? Which should I use?

Many thanks for any help.

Please see the code I used and images attached below:

Statsmodels

import statsmodels.api as sm
import matplotlib.pyplot as plt
sm.qqplot(residuals, line="45")
plt.title("Statsmodels")

Scikit learn

from scipy import stats
import matplotlib.pyplot as plt
stats.probplot(residuals, dist="norm", plot=plt)
plt.title("Sklearn")

The x to y aspect ratio is different in these two plots, so it's hard to compare them visually but they seem to be the same qq plot. The red lines are not the same. In the statsmodels qqplot, it is the y = x line; in the scikit learn qqplot the intercept "looks" to be 0 but the slope is not 1. — dipetkov, Aug 15 '22 at 11:08
The ordered values (order statistics) are the sample quantiles as the QQ plot is a visualization of the empirical cumulative distribution function (CDF). Best way to construct a QQ-plot — dipetkov, Aug 15 '22 at 11:17
Thanks, I have found this post, that seems to be answering more-or-less the same question.
https://stackoverflow.com/questions/48108582/how-to-interpret-scipy-stats-probplot-results#comment83192909_48108582

It seems that the above two graphs are plotting the essentially same thing, but with different y-axis scales. I suppose my question is: which if either of these graphs is valid for use in determining the normality of residuals' distribution given they both give very different visual indications? — Archie, Aug 15 '22 at 12:09
AFAICS, the statsmodels version does not standardize the data by loc and scale, or does not use the estimated mean and variance in the theoretical quantiles. That is, the data looks normally distributed, but not with mean=0 and variance=1. related https://stats.stackexchange.com/questions/585310/calculation-of-quantiles-with-fitted-parameters-in-python — Josef, Aug 15 '22 at 17:12
statsmodels qqplot has other line options, e.g. line="s" or line="r" which adjust for loc and scale in the plot — Josef, Aug 15 '22 at 17:15
If I standardise the quantiles the blue line overlaps the red line almost perfectly. Does this mean that I can say that the distribution is normal? — Archie, Aug 15 '22 at 18:01

dipetkov · Accepted Answer · 2022-08-15T18:18:37.327

These are the same two QQ plots. However, the aspect ratios and the two lines are different.

Aside: In the second QQ plot (with better scaling) we see that the sample has a heavier right tail than the Normal and is somewhat skewed. There are a lot of points in this QQ plot, so this indicates a degree of non-normality. You should look at the residual plot as well, ie, plot the residuals against the fitted values.

In both QQ plots:

[x-axis] The theoretical quantiles are for the $\operatorname{Normal}(0,1)$ aka the standard Normal with mean (location) = 0 and standard deviation (scale) = 1.
[y-axis] The sample quantiles are the ordered values.

The first QQ plot is in 1:1 aspect ratio and the line is $y = x$. This is "wrong" for your residuals because they are on a different scale: their observed range is only [-0.6,0.6].

The second QQ plot is in the observed aspect ratio. The line is the OLS fit for residuals ~ theoretical quantiles. That is, both the center and the scale are estimated to get the best fitting line. You may not want to do the adjustment if the residuals are standardized and you expect them to be N(0,1). Or you may adjust the scale only if you expect the residuals to be $\operatorname{Normal}(0,\hat{\sigma})$.

Note that it's possible to specify the location and scale of the theoretical distribution. That's more interpretable than adjusting the aspect ratio.

Figure: Four QQ plots of the same $n=20$ sample from $\operatorname{Normal}(0,0.3)$, with and without scaling. Plotting $\operatorname{Normal}(\mu,\sigma)$ on the x-axis is equivalent to plotting $(\text{sample}-\mu)/\sigma$ on the y-axis. But usually we don't know the true mean $\mu$ and standard deviation $\sigma$. Instead we use the sample estimates $\hat{\mu}$ and $\hat{\sigma}$ to shift & rotate the sample quantiles, so that they fit the $\operatorname{N}(0,1)$ quantiles as well as possible.

Thanks for your answer, that's really useful. I have standardised my residuals and now the actual and theoretical samples align very closely. Can I conclude that my data is normally distributed (but without mean = 0 , and not in the scale (0,1))? — Archie, Aug 15 '22 at 18:10
I'm not convinced it's that close to Normal. As I wrote I think the residuals are skewed (and linear scaling cannot change skewness). I would look at more diagnostic plots. — dipetkov, Aug 15 '22 at 18:11

What are the main difference between a QQ plot and a probability plot for measuring nomality?

1 Answers1