Which error metric should I use for summed estimates?

Question

I built an ensemble regression model (Random Forest + KNN + SVM) to predict biomass based on environmental conditions (biomass is strictly positive but continuous).

I now would like to use this model to predict total global biomass. To get the total biomass, I predict the biomass for each grid on a global map and then sum the numbers.

Now my question is, how do I add errors to this estimate? I have calculated RMSE and MAE. But both RMSE and MAE are larger than the mean. Simply adding error bars based on MAE value would thus create negative lower limits which is not reasonable since biomass has to be strictly positive.

Are you looking for a prediction interval or quantile predictions, or for some confidence interval or similar on the error measure? On MAE vs. MSE, you may find this interesting. — Stephan Kolassa, Mar 19 '23 at 14:52
I'm looking for the prediction interval of my summed estimate (or if more appropriate my mean biomass estimate). After a bit more reading it looks like something like MAPIE (Model Agnostic Prediction Interval Estimator) could be suitable? Then I can estimate the PI for every grid point and take the mean for the lower and upper bounds. — coccolith, Mar 20 '23 at 11:41
I am not familiar with the MAPIE. I would expect biomass to have strong [tag:spatial] correlations, I certainly hope you are accounting for this in your modeling! If you have a simple predictive distribution (e.g., multivariate normal with a covariance matrix that encodes these spatial correlations), you can analytically derive the predictive distribution of total biomass. But if you have near-zeros, your case is likely not that simple. You may be able to use simulation. — Stephan Kolassa, Mar 20 '23 at 12:06

Which error metric should I use for summed estimates?

0 Answers0