Optimal prediction under squared percentage loss

Question

I have to find an answer on the following question but I am struggling:

Consider a leaf of a decision tree that consists of object-label pairs $(x_{1}, y_{1}), \dots, (x_{n}, y_{n})$.

The prediction $\hat{y}$ of this leaf is defined to minimize the loss on the training samples.

Find the optimal prediction in the leaf, for a regression tree, i.e. $y_{i} \in \mathbb{R}$, and squared percentage error loss $\mathcal{L}(y, \hat{y}) = \cfrac{\left(y - \hat{y} \right)^{2}}{y^2}$.

I tried just to take the derivative of the loss function and setting it to 0, which only yields $y=\hat{y}$ which can not be the result. Intuitively, something like the mean value of the observations present in the leaf should come out, right?

Hint: yes, if there is only a single $y$ in the terminal leaf, then of course the optimal prediction is equal to that value: $\hat{y}=y$. However, we will usually have multiple training samples $y_1, \dots, y_n$ in the terminal leaf, which we want to summarize using a single prediction value $\hat{y}$. Which will minimize the loss? (This will depend on how the loss is summarized over multiple training observations; you can probably assume it's averaged. Also, let's hope very much that none of the $y_i=0$.) — Stephan Kolassa, Dec 13 '21 at 07:38
Incidentally, that the answer will probably not be the mean of the training observations is related to one issue with the Mean Absolute Percentage Error (MAPE), which is quite similar to your loss function (just without the squaring). — Stephan Kolassa, Dec 13 '21 at 07:52
Thx @StephanKolassa. Having read the article, you mean that the problem is that my loss function is not differentiable for $y_i = \hat{y_i}$? — ghxk, Dec 13 '21 at 08:20
No, your loss is quite nicely differentiable, in contrast to the MAPE. But note that there is no subscript $i$ for your prediction $\hat{y}$. You have multiple training instances $y_1, \dots, y_n$ in your leaf and want to summarize them with a single prediction $\hat{y}$. Just set up the loss function, averaging over the training samples, and differentiate. — Stephan Kolassa, Dec 13 '21 at 08:26
@Stephan Kolassa: What do you mean by "averaging over the training samples"? If I just differentiate the sum of my loss function with respect to $\hat{y}$, I get exactly $\hat{y} = \sum\limits_{i = 1}^n y_i/n$ — ghxk, Dec 13 '21 at 08:32
Sorry to contradict you, but no, you don't. Go through your differentiation once more. — Stephan Kolassa, Dec 13 '21 at 08:34
Sorry fore not seeing your point but, what should be wrong in my following derivation?$\frac {\partial \sum\limits_{i = 1}^n \frac{(y_i-\hat{y})^2}{y_i^2}}{\partial \hat{y}} = \sum\limits_{i = 1}^n \frac{-2(y_i-\hat{y})^2}{y_i^2} = 0 $
Multiplying both sides with $-\sum\limits_{i = 1}^n y_{i}^2$ gives $\sum\limits_{i = 1}^n 2(y_i-\hat{y})$. Solving for $\hat{y}$ gives the above mentioned. — ghxk, Dec 13 '21 at 08:52
Your differentiation under the sum is incorrect: $$\frac{d}{d\hat{y}}\frac{(y_i-\hat{y})^2}{y_i} = \frac{-2(y_i-\hat{y})}{y_i^2}$$. You have a power of $2$ that goes away when you differentiate. Can you take it from there? — Stephan Kolassa, Dec 13 '21 at 08:55
Yes, sorry that was a typo: So the correct differentiation is $\frac {\partial \sum\limits_{i = 1}^n \frac{(y_i-\hat{y})^2}{y_i^2}}{\partial \hat{y}} = \sum\limits_{i = 1}^n \frac{-2(y_i-\hat{y})}{y_i^2} = 0 $. But when I resolve this, I still get the mean....as far as I see it ;( — ghxk, Dec 13 '21 at 09:12

score 3 · Accepted Answer · answered Dec 13 '21 at 09:24

Let us assume that we have training observations $y_1, \dots, y_n$ in the leaf, all of which are nonzero. Let us further assume that we summarize losses using their sum, which is equivalent to taking their average: $$ L = \sum_{i=1}^n \frac{(y_i-\hat{y})^2}{y_i^2}. $$ To minimize the loss, we take the derivative with respect to $\hat{y}$ and set it to zero: $$ \frac{d}{d\hat{y}}L = \frac{d}{d\hat{y}}\sum_{i=1}^n \frac{(y_i-\hat{y})^2}{y_i^2} = \sum_{i=1}^n \frac{-2(y_i-\hat{y})}{y_i^2}\stackrel{!}{=}0,$$ or $$ 0= \sum_{i=1}^n \frac{(y_i-\hat{y})}{y_i^2} = \sum_{i=1}^n \frac{y_i}{y_i^2}-\sum_{i=1}^n \frac{\hat{y}}{y_i^2}= \sum_{i=1}^n \frac{1}{y_i}-\hat{y}\sum_{i=1}^n \frac{1}{y_i^2}, $$ resulting in $$ \hat{y}=\frac{\sum_{i=1}^n \frac{1}{y_i}}{\sum_{i=1}^n \frac{1}{y_i^2}}. $$

As an example, let us simulate $y_1, \dots, y_n\sim U[0,1]$ and find the optimal $\hat{y}$ both numerically and using our formula:

nn <- 100
set.seed(1)
yy <- runif(nn)
yhat_numerical <- optimize(f=function(yhat)sum((yhat-yy)^2/yy^2),interval=c(0,1))$minimum
yhat_theory <- sum(1/yy)/sum(1/yy^2)

Happily, both agree:

> yhat_numerical 
[1] 0.04473853
> yhat_theory 
[1] 0.04473853

Also, the optimal prediction is far away from the "intuitive" prediction, which is just the mean of the training samples:

> mean(yy)
[1] 0.5178471

This illustrates that the "best" point forecast depends on the error or accuracy measure (Kolassa, 2020, IJF).

Optimal prediction under squared percentage loss

1 Answers1

Linked