4

I am trying to predict ambulance demand for the next hour, for a city area in the USA, based on previous demand, weather, large people gatherings, and similar spatio-temporal factors.

I have noticed some points where the features are 'usual' - e.g. a "sunny peaceful Sunday afternoon", but there is unusually high demand - maybe because of a big traffic accident - on which I don't have any data. Thus I decided to try outlier detection and get rid of those points where the target doesn't depend on the data I have - but rather on something else, e.g. an accidental fire that injures some people, but can't be predicted.

I don't like univariate methods like IQR and z-scores, because some high demands might actually depend on the data - bad weather, large gatherings which I have data on, etc. On the other hand, multivariate unsupervised methods might detect extremely bad weather as on outlier, as they are not aware of what the target variable is, and work with the whole dataset trying to find irregular patterns. I can't use supervised methods which need an 'outlier' label, as I don't have an idea of what an outlier is yet - I need an algorithm for that, and that is my actual goal.

My question is: are there outlier detection methods that are aware of what the target variable is? For example: let's say I build a cluster of all sunny peaceful Sunday afternoons, and inspect the demand. I might see: $[1, 2, 0, 2, 1, 11, 2, 0]$ Then the $11$ is an outlier, because it belongs to the cluster of 'regular' Sunday afternoons, where the usual demand is around $1$.

Are there methods that do this already? If not, what would you recommend for me to do?

3 Answers3

7

I would strongly urge you NOT to do this. it is cheating. Your results will be a lie. And your predictions will look far too good. If you use this for forecasting demand, you will go wrong. In particular, you will miss "black swans" and have far too few ambulances available.

Outliers are important. They should not be deleted in this sort of fashion.

Peter Flom
  • 119,535
  • 36
  • 175
  • 383
  • 1
    Thanks for the answer Peter! Might I ask some follow-up questions?
    1. I have seen outlier handling as a preprocessing step in many cases online. Are they all wrong, or are there actual cases where outlier handling is recommended?
    2. I would also like to do this as a part of EDA - proving that the use-case is hard and that some points don't depend on the data I have currently. Would you see this as a valuable step, and how would you approach it?
    – Nadir Bašić Dec 01 '23 at 13:35
  • I won't say ALL are wrong, but, in general, yeah, doing this automatically is a big problem. 2) Could be valuable, but I think there might be easier ways to prove this point. Indeed, the fact that your model doesn't work perfectly is a sign of this, and the mere existence of outliers is another sign. It's easy to show that e.g. "On July 6 there was a huge demand" and then you could look at newspapers or whatever and see that it was due to some weird accident or whatever.
  • – Peter Flom Dec 01 '23 at 13:42
  • Thanks Peter for the clarification, I'll listen to your advice and drop the approach – Nadir Bašić Dec 01 '23 at 14:11
  • 1
    It is important to note if there are outliers if they are found, but whether or not you remove them is an entirely different question, which Peter and I agree on with respect to simple deletion. – Shawn Hemelstrand Dec 02 '23 at 05:00
  • Coming from a predictive modeling perspective, removing outliers like this isn't cheating if it's only done on a training set, and any data used for validation is untouched. It's still probably not a great thing to do. Usually one would use some form of regularization to avoid over-fitting to noisy observations; such as using robust regression as suggested in the other answers. – Albert Steppi Dec 02 '23 at 20:56