1

I have a statistics scenario that doesn't contain a huge number of samples and am trying to set some inputs in the most appropriate way. The data can basically be thought of as having a date, a proximity to some physical center point, and a value. The decisions I am facing are:

  • evaluating varying window sizes for a rolling geometric mean to smooth the data due to its low volume and high variance. I was going to do this using number of samples rather than time period so that there is always sufficient data.
  • the maximum radius from the center point to consider (the higher this is, the less samples required in the time dimension and vice versa) This will also need optimising as taking samples further from the center point ought to introduce higher variance, whereas more samples taken in the time dimension will introduce lag.
  • whether to weight the samples based on their date (potentially using an EWMA), their distance from the center point (I have just done this linearly at the moment although exponential might be better suited) or a combination of both. My concern here is we could give too much influence to too low a number of samples, as the dataset for a small time period and radius is very small and highly varied.

My plan was to grid search some of these combinations and evaluate the result. For example a small window size will create a noisy trend that closely follows the original, whereas a large window will create a smoother trend that doesn't. I was going to use something like a correlation of determination r2 statistic between the two to evaluate how closely the rolling signal tracks the original. Alongside this I was going to measure how noisy it still is through the correlation of variance between its adjacent pairs. If I plot these two measures against each other for different window sizes, I would expect to see something resembling a ROC shape whereby one axis increases as the other decreases. Then I can choose some sweet spot using a metric such as the distance to the left corner (although this weights the two equally). My thinking is that these metrics would reward both low variance in the original data, and low lag in the moving average, allowing me to discover whether it is more advantageous to consider more distant samples by location or time. My question is that I am obviously largely inventing a strategy here, and therefore wondering whether there is something more standard, suitable or appropriate to evaluate the quality of my input data and the resulting moving average.

0 Answers0