Is quantile regression a maximum likelihood method?

Question

Quantile regression allows to estimate a conditional quantile for y (like e.g. the median of y,...) from data x.

I do not see any distributional assumptions about y being made. This seems in contrast to maximum likelihood estimation which starts with making an assumption about the distribution of y (e.g gaussian distribution).

Therefore the question: is quantile regression a maximum likelihood method?

If not, what is the broader term for methods like quantile regression?

Additional rewording: What is the rationale of the quantile loss function in quantile regression (of arbitrary complex models)? Does it rely on the specification of the distributional form of response variable? And, specifically, is the quantile loss (somehow) a (log) likelihood function?

Related: https://stats.stackexchange.com/questions/258362 – Richard Hardy Jun 28 '23 at 13:58 — Richard Hardy, Jun 28 '23 at 13:58

Zhanxiong · Answer 1 · 2023-03-01T05:41:04.613

You seem to confuse two closely-related yet very different concepts: a regression model (which is a specification of a statistical model) and a parameter estimation method (which essentially is a data-based objective function formulation and subsequent numerical procedures).

For simplicity, we restrict ourselves to the linear parametric family. A quantile regression (model) models (i.e., approximates) the $\tau$-th conditional quantile of the response $y$ given predictors $x$ as a linear function of parameters: \begin{align} Q_\tau(y|x) = \alpha + \beta'x. \tag{1} \end{align}

Likewise, a mean regression (model) models the conditional mean of the response $y$ given predictors $x$ as a linear function of parameters: \begin{align} E(y|x) = \alpha + \beta'x. \tag{2} \end{align}

In principle (of course, it is somewhat a quite narrow view), $(1)$ standalone is the heart of "quantile regression" -- we do not need to understand how parameters $\alpha$ and $\beta$ will be estimated to specify a quantile regression model.

The parameter estimation problem kicks in when a sample $\{(y_i, x_i): i = 1, \ldots, n\}$ is observed. As you may have already known, typically $\alpha$ and $\beta$ are estimated by minimizing the sum of check function: \begin{align} (\hat{\alpha}, \hat{\beta}) = \operatorname{argmin}_{\alpha, \beta}\sum_{i = 1}^n \rho_\tau(y_i - \alpha - \beta'x_i), \tag{3} \end{align} which is numerically implemented by the linear programming or interior point algorithm. Of course, this is one of many parameter estimation methods (when $\tau = 0.5$, this is usually referred to as Least Absolute Deviation Estimation), which, as you stated, does not require any distributional assumption of $y$.

The MLE, as another parameter estimation method, on the other hand, can be carried out only if one specifies the complete conditional distribution of $y$. That means, to do maximum likelihood estimation, you will need to specify a statistical model that is more granular than regression models such as $(1)$ or $(2)$ (it is more granular because the complete distribution function contains much more information than the quantile function or the mean function only. In fact, both $Q_\tau(y|x)$ and $E(y|x)$ can be derived probabilistically from the distribution of $y|x$). For example, a statistical model like \begin{align} y | x \sim f(\alpha + \beta'x; \theta), \tag{4} \end{align} where $f$ is some known density function with additional parameter $\theta$. Model $(4)$ then entails the likelihood function (assuming the observations are i.i.d.) \begin{align} L(\alpha, \beta; \theta) = \prod_{i = 1}^nf(\alpha + \beta x_i; \theta), \end{align} which can be maximized over the parameter space to determine MLE.

It is well-known that when $f$ is Gaussian, Model $(4)$ implies the mean regression model $(2)$, and when $f$ is (asymmetrical) Laplacian, Model $(4)$ implies the quantile regression model $(1)$. For other conditional distributions $f$, in general $(1)$ or $(2)$ are not nested in $(4)$ (that is, neither $Q_\tau(y|x)$ nor $E(y|x)$ admits the simple linear form $\alpha + \beta'x$ under $(4)$).

In summary, the question "Is quantile regression a maximum likelihood method?" is somewhat ill-posed because the former is a statistical model while the latter is a parameter estimation method that depends on a more granular statistical model than the quantile regression model. In this sense, I do not think these two concepts are comparable. If by "what is the broader term for methods like quantile regression?", you meant "what is the parameter estimation method typically used to estimate $(\alpha, \beta)$ in $(1)$?", then the answer is $(3)$ -- I am not sure if there is a universally accepted term for this minimization problem, but it may be OK to call it "least $L^1$ estimation", in view of $\rho_\tau(t) = t(\tau - I_{(-\infty, 0)}(t))$ is a piecewise linear function.

Do you or does anyone else know what distributions correspond to quantiles other than the median? — BigBendRegion, Mar 01 '23 at 02:44
@BigBendRegion I couldn't follow your question very well. Can you elaborate? — Zhanxiong, Mar 01 '23 at 02:51
The asymmetric Laplace distribution has a quantile that is the MLE of a parameter: https://en.wikipedia.org/wiki/Asymmetric_Laplace_distribution#Alternative_parametrization — Thomas Lumley, Mar 01 '23 at 04:11
Common terms for the loss function used in quantile regression are "pinball loss", or "quantile, "linlin", "hinge", "tick" or "newsvendor loss". Any of these search terms will likely get more hits than "least $L^1$ estimation" or "asymmetric Laplace loss" or similar. — Stephan Kolassa, Mar 01 '23 at 07:14
@Zhanxiong my question is if the quantile loss function is a negative (log) likelihood function which is to be minimized. In the quantile loss derivation https://stats.stackexchange.com/a/252043/298651 no explicit assumption about a probability distribution is made (and therefore no likelihood is computed). Could you please add more information here. If the quantile loss is (implicitly??) assuming the asymmetric Laplace distribution, what is so special about this distribution that it appears in the context of quantiles? — Ggjj11, Mar 01 '23 at 22:01
@Ggjj11 As the link in your comment demonstrated, the quantile loss is derived without requiring any specific parametric distributional form of $y$. The key observation in discovering the quantile loss is that in one-sample problem, the population quantile minimizes the expected check function, i.e., $Q_\tau(Y) = \operatorname{argmin}{q} E[\rho\tau(Y - q)]$. — Zhanxiong, Mar 01 '23 at 22:12
The point of mentioning asymmetric Laplace distribution in the answer is that when the conditional distribution of $y$ is asymmetric Laplace, then the negative log-likelihood coincided with the quantile loss. Your statement "my question is if the quantile loss function is a negative (log) likelihood function which is to be minimized" is again ill-posed -- if you do not specify the conditional distribution, how are you able to even write down the "negative (log) likelihood function"??? I hope you have carefully read through my answer (and you are not the one who downvoted :)). — Zhanxiong, Mar 01 '23 at 22:16
If I understand your real intention correctly, you may have asked your question like this: "How is the quantile loss function $(3)$ discovered?" or "What is the rationale of the quantile loss function $(3)$ in quantile regression? Does it rely on the specification of the distributional form of response variable?" Either alternative is much clearer than what your original post is (which unnecessarily brought the MLE topic in). — Zhanxiong, Mar 01 '23 at 22:24
If you want, I could rephrase "What is the rationale of the quantile loss function (3) in quantile regression (of arbitrary complex models)? Does it rely on the specification of the distributional form of response variable? And, specifically, is the quantile loss (somehow) a (log) likelihood function?". In the case you find my wording strange, please think about how the mean squared error is derived from a likelihood function assuming a normal distribution in MLE. Surely you understand. — Ggjj11, Mar 01 '23 at 22:37
@Ggjj11 No, like quantile loss function, the squared loss function also does not rely on the normality assumption, it is derived from $\bar{Y} = \operatorname{argmin}_u E[(Y - u)^2]$ for any distribution of $Y$. The relationship between normal and squared loss function is precisely analogous to the relationship between asymmetric Laplace and quantile loss. — Zhanxiong, Mar 01 '23 at 22:40

alexmolas · Answer 2 · 2023-02-28T23:07:23.913

It depends on the loss function you're trying to minimize. In MLE you're not always assuming that the distribution is gaussian. For instance, there's a relation between the distribution you assume and the loss function you try to minimize

Gaussian distribution $\propto e^{-(x-\mu)^2}$ implies $L^2$ loss
Laplace distribution $\propto e^{-|x-\mu|}$ implies $L^1$ loss

In the case of Gaussian distribution

$$ P(y|x) = N(y| f(x;\theta), \sigma^2) $$

where $\sigma$ is fixed. Therefore the likelihood estimation is is

$$ \theta^* = \text{arg max}_\theta \prod_i P(y_i|x_i) = \text{arg max}_\theta \sum_i \log P(x_i | y_i) $$

and therefore

$$ \theta^* = \text{arg max} -n \log \sigma - \frac{n}{2} \log (2\pi) -\frac{1}{2} \sum_i \left(\frac{y_i - f(x_i, \theta)}{\sigma}\right)^2 $$

which removing constant terms is

$$ \theta^* = \text{argmax}_\theta -\sum_i \left(y_i - f(x_i, \theta)\right)^2 $$

which is the standard $L^2$ loss function for regression problems. So, as you can see by defining a loss function you're also defining which distribution you assume of your data.

In the case of quantile regression is the same. Depending on your loss function you'll implicitly be specifying an underlying distribution.

Thank you! I found that is the asymmetric Laplace distribution which is used to construct the the likelihood function, and therefore the log likelihood function, which is used in quantile regression. I am curious for the reasons why the asymmetric Laplace distribution is useful for modeling quantiles. Which property is exploited here? — Ggjj11, Mar 01 '23 at 18:50
whuber argued here https://stats.stackexchange.com/questions/251600/quantile-regression-loss-function that the quantile loss arises automatically from basic considerations. Where did he (implicitly) assume an asymmetric Laplace distribution? I cannot see this. — Ggjj11, Mar 01 '23 at 21:53

Ggjj11 · Accepted Answer · 2023-05-26T09:42:52.137

Quantile Regression is not necessarily a maximum likelihood method (while it can be when using a working likelihood like the asymmetric Laplace distribution).

Performing empirical risk minimization with the quantile loss is not (necessarily) a maximum likelihood estimation method.

You can simply minimize this special risk and can show for arbitrary distributions that it result in an estimation of the quantile: https://en.m.wikipedia.org/wiki/Empirical_risk_minimization

The proof for this goes by

defining the risk as the expected loss function: $R_\tau = \int_R dy p(y) w_\tau(y,\hat{y}) |y-\hat{y}|$
splitting each integral explicitly to resolve the absolute value $|y-\hat{y}|$
taking the derivative with respect to $\hat{y}$ and setting it to zero
identify the definition of the $\tau$ quantile

All this works without assuming a special probability density p, so it is definitely not a maximum likelihood estimation.

This is also how we can show that the expected mean squared minimization estimates the (conditional) mean and the mean absolute error estimates the (conditional) median.

Is quantile regression a maximum likelihood method?

3 Answers3

Linked