How to evaluate a histogram?

Question

Let's say I have the histogram above which reports performances of some neural networks. The y axis is the bin size, while the x axis is the error, so low errors = high performance.

Out of these 15 networks, how would I go about selecting the best one? Looking at the mean or the median error comes to mind, but I can think of many cases where mean and the median could be misleading. Is there a way to pick the consistently high performing network?

Here it is obvious that MLP 10 has the best performance, but sometimes evaluating a network becomes much harder.

The data is not necessarily normally distributed.

score 3 · Answer 1 · answered Aug 10 '23 at 07:42

Those histograms are not very useful for the task. Don't get me wrong, histograms of errors can be helpful to have a more in-depth look at the errors (e.g. are they all small, or do they have long tails?), but they themselves are usually not used for picking the best model.

One problem with histograms is that judging them is a bit subjective. For example, compare the models MLP_7 and MLP_8. Would you pick a model that has slightly fewer very small errors and a few small errors (MLP_7), or a model that has more very small errors but more errors that are larger than the previous model (MLP_8)? It is hard to say, especially if you don't know the exact readings of the bar heights on the $y$-axis and the size of the errors on the $x$-axis. To make it less subjective, you could calculate the expected value of the errors, which could be approximated by using the histogram to calculate the weighted mean

$$ E[x] = \int x \, p(x) \, dx \approx \frac{\sum_i x_i h(x_i)}{\sum_i h(x_i)} $$

where $h(x)$ is the histogram height for $x$. But you don't need a histogram for this. Instead, as noticed by @boomkin in the other answer, just use some error metric, like mean squared error, that would tell you about the errors "on average".

When comparing the models you want to have an unambiguous criterion for model choice. As discussed in the Full Stack Deep Learning Course, having different metrics about the model performance is often useful for debugging it, gaining insights on it, etc, but for model choice and optimization, you need to pick a single metric that you optimize. If you need to use multiple metrics, you should collapse them into a single metric (similar as we could use expected value instead of histogram).

Finally, the histograms you have shown us are not useful because they don't let us compare the distributions with enough precision. Notice that most of the histograms look nearly the same. They collect the vast majority of the errors in the first bin. To compare them, you would need to use more bins so that you can easier differentiate between the small errors. Since the errors have long tails, it may be useful to plot the logarithm of errors. On another hand, it would "hide" the large errors and you most likely care about them.

score 1 · Answer 2 · edited Jul 03 '19 at 06:17

1

In terms of evaluating the performance of your neural network, you typically want to use some kind of estimate for a generalisation error, like hold-out validation (testing on a separate dataset) or K-fold cross-validation.

Assuming you use a separate dataset for comparison, the reasonable thing to do is to use the loss function you have used for training your neural network as an evaluation metric.

The most common choices for loss functions are the mean squared error and cross-entropy.

Mean squared error is a typical choice for regression problems:

$$ MSE = \frac{1}{N} \sum_{i=1}^N (Y_i - \hat{Y}_i )^2 $$

where $Y_i$ is the true regression value and $\hat{Y_i}$ is your neural network's prediction.

While the cross-entropy is used for classification problems:

$$ C = - \sum_{i=1}^N y_{o,c} \log p_{o,c} $$

where $y_{o,c} $ is a binary indicator of correct classification, and $p_{o,c}$ is the predicted probability for that class.

If you used the above two as an error metric, I would take the average of the errors on each histogram and choose the minimum.

Otherwise, I would calculate the value the loss functions on your separate dataset, and choose the neural network which gives the minimal error.

edited Jul 03 '19 at 06:17

Nick Cox

56,404
8
127
185

answered Feb 06 '18 at 02:08

boomkin

855

1

Thank you. In this case, each network controls a robot performing a task, and the error is reported as above. If I sum up all the squared errors, this will give me a ranking of the networks, which is what I need. This will also inflate the large errors though. What if I think errors around 0-250 are acceptable, and an error of 2000 and an error of 10000 are both unacceptable, so there shouldn't be much difference in large errors. Does this mean the error I used for training is not quite good? Is there an error function with diminishing returns on high errors? – ovunctuzel Feb 06 '18 at 23:36
1

I think the penalising more heavily won't solve the large errors, in my experience, what typically works is taking a look at the erroneous examples and considering some kind of extension of the training set based on it. As far as I'm concerned there is no loss function with diminishing returns behaviour, but every week somebody finds out a new kind of loss function or metric or finds out some kind of heuristic for a previously intractable loss function. Try a few alternative loss functions and see if they change performance. – boomkin Feb 07 '18 at 15:29
I like model selection criteria, like AICc, to select a model. – EngrStudent Nov 30 '21 at 22:19

How to evaluate a histogram?

2 Answers2