1

I have a set of data that states the volume required each month for the next 12 for 750 raw materials.

I would like to determine the variability in the demand for each material, and categorise the results into a low, medium, high format.

I have found some potential candidate statistical methods to measure this:

  1. Coefficient of Variation
  2. Interquartile Range / Median
  3. Median Absolute Deviation from the Median (/median)
  4. mean absolute deviation (/mean)

What would be the best method to determine the variability?

The data for each material may not be normally distributed; however, the same method needs to be applied to them all. In many cases the data could have quite a lot of skew.

For the advised method, please state the prerequisites to use it, the advantage it has over other methods and provide guidance on how I would interpret the results in the format stated above. FYI if you feel that there is a better measure of variability i haven't listed please feel free to suggest.

A few additional items of information that will be useful

  • I am using Excel to do the calculations

  • The chosen method and methodology for interpretation needs to be relatively simplistic as it will be adopted by people who are not statistically minded

  • The calculation for variability will be based on a population size of 12

  • The data is on a ratio scale (0 is really 0)

Shaun
  • 11
  • 1
    These are measures of relative variability. It's not obvious from your instructions that you wouldn't be better off with a measure of variability. – Nick Cox Jul 19 '19 at 11:24
  • I would to use a relative variability method because I want to be able to compare the results for each material to see if Material X has more variation than Material Y. The volumes can range greatly from 50kg to 100,000kg so straight measures of variability like: Range, Interquartile Range, Variance, Standard Deviation will have huge differences – Shaun Jul 19 '19 at 12:15
  • 1
    Indeed; that's not contradicting my point. For some purposes (e.g. warehouse storage) variability of small amounts could be trivial. But, to take this forward, note that your title and your question don't make this clear -- which is especially important if this thread is to be of use to others. As variability of volumes (by which apparently you mean weights or masses) is not what you want, the question might be better revised. – Nick Cox Jul 19 '19 at 12:23
  • I'm using the high/med/low status to drive a safety stock setting recommendation, so for my purposes, high variability of even low volumes is still very important as shortage of any material will cause production delays. I'm unable to edit the title but what do you think it should be to get maximum benefit? – Shaun Jul 19 '19 at 12:46
  • You really should be able to edit the title. I'd add a question mark, cut "potentially" and insert "relatively", – Nick Cox Jul 19 '19 at 12:57
  • 2
    If you measure relative variability (how best to do it for your purpose being an open question) then degrading a measure to low, medium, high is a backwards step, That's like measuring height and then dividing people into short, medium and tall. More positively you already know several methods so that you can compare methods to see if they produce the same ranking. – Nick Cox Jul 19 '19 at 13:00
  • 1
    Shaun - if you were looking at change in central tendencies of things with very large differences in magnitudes, and a 10% change in a small thing was as important as a 10% change in a big thing, then one useful measure would be the geometric mean. There is a related measure called the geometric standard deviation that might suit your needs very cleanly. If you could provide some "dummy data" an example could be made to demonstrate this. It is easy to explain it easily to bosses. – EngrStudent Jul 19 '19 at 13:36
  • Shaun, sorry for the double-comment, but I just saw that you were new to CV. Welcome to CV! Some folks here get down to math first, and then notice humans after they have posted a comment or two. It is good for high quality technical results. – EngrStudent Jul 19 '19 at 13:42
  • Thanks Nick and EngrStudent for your help so far. I will research Geometric mean. Nick, I would disagree with your comparison of height, to categorise people into short, medium, tall is a valid reason to measure people in the first place. Having a relative variability figure on its own doesn't help me much unless i can say this figure sits within a category range & together with the lead time it equates to X safety stock setting. I plan to take all the figures of variability & plot them on a distribution graph to determine what is low (<25% of values), med (25-50), high(>75). – Shaun Jul 19 '19 at 14:29
  • I have been reading forums and articles all week but I am not clear on which method is best for my data. Some say that CV shouldn't be used for data which isn't normal. I'm under the assumption the 4 methods above are equally able to determine variability however the prerequisites to use them are different. There is no way to determine which method is best by comparing the results of running them all as the results are subjective, therefore the best i can do is to ensure i have used the most statistically accurate method for my data set, I'm just not clear on what that is. – Shaun Jul 19 '19 at 14:41
  • Where you did read that about CV? It is nonsense. Indeed, the mean of a normal could be zero or negative which fact alone would make the CV useless. There are several threads here giving better advice. More generally, I can't gauge this easily as you seem to be advising non-statistical people yet in doubt on fundamentals. There is no criterion of "most statistically accurate" here independent of your data (which as @EngrStudent urged we would benefit from seeing) and your goals, which are fairly clear now but understandably aren't precise enough to indicate a best solution. – Nick Cox Jul 19 '19 at 18:30
  • Hi Nick, i read it here, https://statistics.laerd.com/statistical-guides/measures-of-spread-standard-deviation.php, "the standard deviation, like the mean, is normally only appropriate when the continuous data is not significantly skewed or has outliers". The means for all materials are positive, all data is on a ratio scale and so CV could be done, but i was apprehensive because the data for each material may not be normal and it could be skewed. – Shaun Jul 21 '19 at 18:14
  • I meant i wanted the most statically accurate method for my data not to find the best one independent of it, i understand that is not a question that has an answer. The goal is to categorise the variability for each SKU, be able to compare the results from material to material and together with the lead time assign a safety stock setting. I could have more categories but i wanted to see what the data indicated before making that decision, my initial thinking was to have 3 – Shaun Jul 21 '19 at 18:17
  • That’s advice about the SD not the CV. Even in that case the warning is exaggerated. Threads here explain that the use of CV makes most sense whenever a logarithmic scale is appropriate. – Nick Cox Jul 21 '19 at 21:30

3 Answers3

5

Gini's mean difference is always interpretable and is efficient and robust. It is defined as the mean over all possible pairs of different observations of the absolute difference between them. I haven't seen it studied under skewed distributions but I'll bet it works as well as anything. If you happen to have a Gaussian distribution it is 0.98 as efficient as the standard deviation. But it is far more interpretable than SD. Gini's mean difference is standard output of the R Hmisc package describe function, which uses a quick computational formula in the GiniMD function. See this for more information.

Frank Harrell
  • 91,879
  • 6
  • 178
  • 397
2

First, the only correct answer here is "none of them are best, it depends on what you mean by 'variability'". Each method will have advantages and disadvantages. Are you primarily interested in large deviations and outliers? Or not? (See below)

Second, you can't have data on what's going to happen for the next 12 months. You might have estimates of that, but you can't have data.

Third, as Nick pointed out in a comment, if you are going to take the results and then categorize them, you are taking a step backwards and, if it's going to be as few as three categories, you are probably eliminating any differences across at least most of the methods you mention.

Fourth, I wouldn't use Excel for data analysis, myself. It's very limited.

Let's compare some of the methods using R on a few sets of data. I suggest you take these as examples but use your own examples and see which suit your purposes:

x1 <- c(1, 2, 3, 4, 5, 6, 7, 8,  9, 10, 11, 100)  #Single outlier
x2 <- c(1, 1, 1, 2, 2, 2, 3, 4, 10, 11, 20. 20) #Repeats with outliers
x3 <- c(rep(1,11),100)  #11 values the same, one outlier

#Coefficient of Variation

sd(x1)/mean(x1) #1.98
sd(x2)/mean(x2) #25.03
sd(x3)/mean(x3) #3.09

#Interquartile Range / Median
IQR(x1)/median(x1) #0.85
IQR(x2)/median(x2) #36.18
IQR(x3)/median(x3) #0

#Median Absolute Deviation from the Median (/median)
mad(x1)/median(x1) #0.68
mad(x2)/median(x2) #27.02
mad(x3)/median(x3) #0



#mean absolute deviation (/mean)
install.packages("lsr")
library(lsr)

aad(x1)/mean(x1) #1.04
aad(x2)/mean(x2) #18.96
aad(x3)/mean(x3) #1.65

Note that for all four measures you listed, x2 has the highest value. Is that what you want? I can think of cases where it should be lowest - if you are ordering supplies, you want to know about outliers. Suppose that is your case; that is, suppose you are ordering supplies for the next 12 months and you don't want to run out. Then none of the proposed methods are very good.

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
Peter Flom
  • 119,535
  • 36
  • 175
  • 383
1

It's not possible to define a metric that will perform best in all possible cases.

Consider this excellent example by @Silverfish of a case where the minimum, first quartile, median, third quartile, and maximum are all identical - but still have very different distributions.

Or Anscombe's quartet, which has essentially identical variances across 4 very different patterns:

enter image description here

mkt
  • 18,245
  • 11
  • 73
  • 172