2

A straightforward question: when is it "favorable" to interpret visualizations using, say, boxplots, histograms/density estimates, versus empirical cumulative density functions (ECDFs)?

I've always relied on histograms and density estimates for getting a sense of data distributions, and boxplots for comparing distributions. However, the above methods summarize and discard information. ECDFs do not. Several advantages of ECDFs include: non-parametric, easy to identify quantile values and outliers, can identify stochastic orders at a glance, is independent of tuning parameters, can exist where densities shouldn't (e.g. mixed variables of discrete and continuous), and so on.

So, again, when should I prefer other techniques over ECDFs and why aren't ECDFs more common/standard in data visualization? Maybe my latter claim is false, but from experience, ECDFs aren't all that common in data visualization.

Relevant discussions and article:

  1. If the sample size is > 100, which graphical summarization is the best?

  2. What does a Barplot, a Boxplot and eCDF represent?

  3. Are CDFs more fundamental than PDFs?

  4. Why we love the CDF

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
fool
  • 2,440
  • 1
    Much depends on (1) the field in which you work (2) what you regard as the outcome or response to describe and interpret. For many, plotting the survival function (converse or complementary distribution function) is utterly standard, e..g in medical or biostatistics (patients, animals, ...) or in engineering statistics (lifetimes of products or machines or parts). If ECDF plot means plotting cumulative probability on the y axis, note that plotting it on the x axis in principle is exactly equivalent, but in practice can be more congenial or convenient (as quantile plots or under other names). – Nick Cox Apr 13 '22 at 08:35
  • 1
    It's lame but still accurate to say: Whatever works best for your data and problem. For example, boxplots with median and quartiles and extra stuff are often oversold -- even used with ANOVA without even adding means -- but if the job is to compare say 10 or 30 distributions it is hard to beat 10 or 30 box plots side by side and 10 or 30 distribution functions could usually be just spaghetti. But see the device discussed in https://stats.stackexchange.com/questions/190152/visualising-many-variables-in-one-plot in which each subset is plotted in turn with the others as backdrop. – Nick Cox Apr 13 '22 at 08:45
  • @NickCox sorry I didn't see your comment until many months later (now). I still found your comment helpful, and I've learned/experienced the essence of "whatever works best for your data and problem" over the months as well. Thanks :) – fool Nov 17 '22 at 01:02

1 Answers1

1

Consider the following sample of a thousand observations from a gamma distribution with mean $\mu = 15$ and median $\eta \approx 13.37,$ with numerical summaries below.

set.seed(2022)
x = rgamma(1000, 3, 0.2)
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.236   8.673  13.495  15.013  19.939  53.914

Important advantages of the ECDF are that it does not depend on arbitrary choices and does not lose information. Often an ECDF of a sample of moderate or large size seems to give a closer approximation to the population CDF, than a histogram approximates the population density. In the figure below, the choice to use ten bins for the histogram is arbitrary. Someone familiar with ECDFs will have no trouble finding the approximate median and quartiles.

enter image description here

R code for figure above:

set.seed(2022)
x = rgamma(1000, 3, 0.2)
par(mfrow=c(1,2))
 hist(x, prob=T, col="skyblue2", ylim=c(0,.06))
  curve(dgamma(x,3,.2), add=T, lwd=3, 
        lty="dotted", col="brown")
 plot(ecdf(x), col="blue")
  curve(pgamma(x,3,.2), add=T, lwd=3, 
        lty="dotted", col="brown")
par(mfrow=c(1,1))

However, histograms are familiar to many non-statisticians. They clearly show the approximate mode of the sample and show whether the sample is skewed. Also, the balance point of a histogram gives a good intuitive idea of the mean of a sample.

Boxplots show skewness clearly and explicitly show the median and quartiles of the sample. They also call attention to outliers. But they do not give an intuitive idea of the location of the mean. Ordinarily, boxplots should not be used for samples with fewer than about ten observations.

enter image description here

boxplot(x, horizontal=T, col="skyblue2")

Dotplots and stripcharts work well for small samples but can get very cluttered for large sample, such as the one illustrated above. By contrast, histograms and ECDFs can be almost useless for very small datasets.

enter image description here

par(mfrow=c(2,1))
 stripchart(x, pch="|")
 stripchart(x, pch=20, meth="jitter")
par(mfrow=c(1,1))

This is hardly a complete discussion of graphical descriptions of data. However, just from these few plots, I think it is fair to conclude that each of these graphical summaries of a dataset has advantages and disadvantages. Which one(s) to use depends on the sample size and the properties of a sample that need to be emphasized in a given situation---and for what audience.

BruceET
  • 56,185