Histogram - what constitutes grouped data?

Question

One of the questions on my course asked us to identify what type of data a histogram is used for. Two of the options were continuous or grouped data. The correct answer was continuous as this is the form the original data is in. Everybody else got this right so I’m struggling to see it correctly. So if you had a simple histogram and the bins each represent a single integer and not a range even if you put all the single integers into their respective bins you aren’t grouping that data together to create a frequency count? It is only classed as grouped data if the bin represents a range of the available data? So if your histogram showed number of chocolate bars eaten a day and you had bins for 1,2,3,4,5 and you put the count for each number in the bins that isn’t classed as grouping? It only becomes grouped data if the bins represent more than a single integer in this case?

Thanks very much

Welcome to CV. Note that your username, identicon, & a link to your user page are automatically added to every post you make, so there is no need to sign your posts. In fact, we prefer you don't. — kjetil b halvorsen, Dec 13 '20 at 21:20

David M. · Answer 1 · 2020-12-13T23:05:44.433

The differences between bar charts and histograms are as follows:

bart charts represent categorical data (e.g. hair colour, countries, etc.) or discrete data (e.g. number of siblings in a family). The height (for a vertical chart) or width (for a horizontal chart) of each bar represents the frequency/count/sum/average of values in the corresponding category. There is usually a space between adjacent bars. Because the data are categorical, there is no natural ordering of the bars.
histograms represent continuous data (e.g. time, distance, etc.); you first have to bin the data (which requires to arbitrarily decide on a bin width - different widths will produce different histograms for the same data) and the height/width of each bar represents the count of values in the corresponding bin. Adjacent bars usually touch each other. As data are numerical, they are naturally ordered (usually from lowest to highest).

It follows that for bar charts you need two types of data: the categories (e.g. countries) and the data points (e.g. population in each country) whereas the histograms only require the data points (e.g. racing times) which you bin, and then count.

The difference between discrete and continuous data is not always clear-cut, and therefore which of these two charts should be used can be a matter of judgement. Imagine for instance that you would like to represent the distribution of children from the age of 1 to 4 in a given country: you can draw a bar chart where each bar represents an age (1, 2, 3 and 4) and the height of the bar represents the count of children of each age. Now imagine that you wish to create a chart for all ages, from 1 to 120. Then drawing one bar per year could become impractical and you may want to bin the years into age groups (e.g. 1 to 10, 11 to 20, etc.) and count the occurrences of people in each group; in this case you would use a histogram.

To use your example, if each bar represents the number of chocolate bars eaten in a day (one bar for each of the values 1, 2, 3, 4 and 5), you create a bar chart. If, on the other hand, you bin the data (e.g. you create a group for 1 to 2 chocolate bars, and another one for 3 to 5 chocolate bars), you then create a histogram.

It only becomes grouped data if the bins represent more than a single integer in this case?

Yes.

Slight disagreement in semantics hinges of the issue that distinctions between continuous and discrete data are not 'clear-cut'. So-called continuous data must be rounded in practice, and thus technically might then be called discrete. (+1) Anyway. — BruceET, Dec 13 '20 at 22:12
Usually the areas of histogram bars represent counts: not the height or width. This distinction may be crucial to appreciating the point of the original question. — whuber, Dec 13 '20 at 22:31
@BruceET Fair point; the dual nature of data can arise from either (1) discrete data being seen as continuous because of large number of values (e.g. cell counts) or (2) continuous data being considered discrete (because of rounding). — David M., Dec 13 '20 at 22:57
@whuber Good point; I assumed equal ranges, which is not necessarily the case. — David M., Dec 13 '20 at 23:00
Thanks very much for the detailed answers. I think maybe my problem arises from the difference between continuous and discrete data as pointed out. I think of continuous data as capable of taking on any value within a range with finer and finer increments if wanted. At some point this data must be rounded in order to go into a bin even if this happens at the point of measurement. Histograms are jagged as they are an approximation of the curve that is often overlaid. If u have good continuous data and you don’t want to group why not plot frequency against the actual value in a scatterplot? — Geoff, Dec 14 '20 at 00:11
If the data are continuous, you are likely to have very few occurrences of each value so your scatterplot will be pretty noisy and it will be difficult to spot patterns - hence the benefit of aggregation using histograms for instance. If you want a non-jagged plot, you can compute the Kernel Density Estimator from your data (though the exact shape will depend on your choice of bandwidth). — David M., Dec 14 '20 at 10:08

BruceET · Answer 2 · 2020-12-13T22:39:15.757

IMHO: This kind of vaguely stated multiple-choice question, depends heavily on reading the definitions, examples, and guidelines in the material presented just prior to the question. You may find specificity there that is lacking in the question itself.

Continuous data. In defense of the purported 'correct' answer: In practice, it is generally true that histograms are used for continuous data. The histogram bins are chosen in order to make an attractive and useful histogram. Often the goal is for the shape of the histogram to suggest the shape of the population density.

Suppose you have a sample of size $n=200$ from the continuous distribution $\mathsf{Norm}(\mu = 50, \sigma = 7).$ Then here are some possible histograms:

set.seed(1213)
x = rnorm(200, 50, 7)
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  31.61   45.46   50.83   50.28   55.45   72.58

The histogram at upper-left uses the 'default' binning, according to an algorithm in R; for the second, I used the br parameter (for 'breaks') to request fewer bins; for the two at the bottom, I specified precisely the breaks I wanted to use. Tick marks along the horizontal axis (made by rug), show exact data values--to the extent allowed by the graphics available. All four histograms show the population normal density function as a red curve.

Usually, $n=200$ observations is not quite enough to get a really good imitation of the density function. However, in this example, R's default seems to work best.

R code for figure above:

par(mfrow=c(2,2))
hist(x, prob=T, col="skyblue2");  rug(x)
 curve(dnorm(x, 50,7), add=T, col="red")
hist(x, prob=T, br=5, ylim=c(0,.06), col="skyblue2"); rug(x)
 curve(dnorm(x, 50,7), add=T, col="red")
hist(x, prob=T, br=seq(30,75,by=4), col="skyblue2"); rug(x)
 curve(dnorm(x, 50,7), add=T, col="red")
hist(x, prob=T, br=seq(30,75,by=2), col="skyblue2"); rug(x)
 curve(dnorm(x, 50,7), add=T, col="red")
par(mfrow=c(2,2))

More general use of histograms. By contrast, many statisticians will be surprised to be told that discrete or categorical data should never be shown in histograms. [Often 'bar charts' are used for categorical data, but histograms are often used for discrete data (especially, when there are many possible values), and for ordinal categorical data.]

Suppose that the sample y consists of $n = 100$ Likert-7 scores, and sample z consists of $n = 500$ realizations of $\mathsf{Binom}(n=100, p = 1/3).$

set.seed(2020)
y = sample(1:7, 100, rep=T, p=c(1,2,3,4,3,2,1)/16)
table(y)
y
 1  2  3  4  5  6  7 
 7 10 24 26 18 10  5
z = rbinom(500, 100, .4)
summary(z)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  25.00   36.00   40.00   39.73   43.00   55.00

The histogram in the left-hand panel uses a 'Frequency' vertical scale. In the right-hand panel red dots show exact binomial probabilities (two per bin). In a 'density' histogram (density on the vertical axis, using parameter prob=T) the total area of all histogram bars is $1.$ This is not necessarily true for a 'frequency' histogram (frequency on the vertical scale, and potentially frequency counts labelling each bar).

R code for second figure:

par(mfrow=c(1,2))
 hist(y, label=T, br=(0:7)+.5, ylim=c(0,35), col="skyblue2")
 k = 0:100;  pdf=dbinom(k,100,.4)
 hist(z, prob=T, br = seq(-.5,100.5,by=2), col="skyblue2")
  points(k, pdf, pch=20, col="red")
par(mfrow=c(1,2))

Histogram - what constitutes grouped data?

2 Answers2