26

In Bishop's Pattern Recognition and Machine Learning I read the following, just after the probability density $p(x\in(a,b))=\int_a^bp(x)\textrm{d}x$ was introduced:

Under a nonlinear change of variable, a probability density transforms differently from a simple function, due to the Jacobian factor. For instance, if we consider a change of variables $x = g(y)$, then a function $f(x)$ becomes $\tilde{f}(y) = f(g(y))$. Now consider a probability density $p_x(x)$ that corresponds to a density $p_y(y)$ with respect to the new variable $y$, where the suffices denote the fact that $p_x(x)$ and $p_y(y)$ are different densities. Observations falling in the range $(x, x + \delta x)$ will, for small values of $\delta x$, be transformed into the range $(y, y + \delta y$) where $p_x(x)\delta x \simeq p_y(y)δy$, and hence $p_y(y) = p_x(x) |\frac{dx}{dy}| = p_x(g(y)) | g\prime (y) |$.

What is the Jacobian factor and what exactly does everything mean (maybe qualitatively)? Bishop says, that a consequence of this property is that the concept of the maximum of a probability density is dependent on the choice of variable. What does this mean?

To me this comes all a bit out of the blue (considering it's in the introduction chapter). I'd appreciate some hints, thanks!

Akimiya
  • 3
  • 2
ste
  • 514

1 Answers1

11

I suggest you reading the solution of Question 1.4 which provides a good intuition.

In a nutshell, if you have an arbitrary function $ f(x) $ and two variable $x$ and $y$ which are related to each other by the function $x = g(y)$, then you can find the maximum of the function either by directly analyzing $f(x)$: $ \hat{x} = argmax_x(f(x)) $ or the transformed function $f(g(y))$: $\hat{y} = argmax_y(f(g(y))$. Not surprisingly, $\hat{x}$ and $\hat{y}$ will be related to each as $\hat{x} = g(\hat{y})$ (here I assumed that $\forall{y}: g^\prime(y)\neq0)$.

This is not the case for probability distributions. If you have a probability distribution $p_x(x)$ and two random variables which are related to each other by $x=g(y)$. Then there is no direct relation between $\hat{x} = argmax_x(p_x(x))$ and $\hat{y}=argmax_y(p_y(y))$. This happens because of Jacobian factor, a factor that shows how the volum is relatively changed by a function such as $g(.)$.

MajidL
  • 226