2

While browsing for information on how I might plot a fitted normal curve over a histogram, I found the following:

http://www.statmethods.net/graphs/density.html

There is a line I don't fully understand, though I recognize that it really does work:

yfit <- yfit*diff(h$mids[1:2])*length(x)

Here, yfit is initially a list of values drawn from the pdf of an inferred normal distribution at regular intervals along the x-axis, length(x) is the number of observations in a list x from which a histogram was prepared, and diff(h$mids[1:2]) is the difference between the midpoints of the second and first bars of said histogram on the x-axis. After this statement is run, yfit becomes itself multiplied by those other two terms.

I understand that multiplying by length makes sense as this turns values for a probability distribution function into number of observations around each respective value—taking into account that a continuous pdf is being used here and the number of observations at any single point is zero.

I don't understand why it is necessary to multiply by diff(h$mids[1:2]) to get the right outcome in the graph, although I can confirm that it does get the right outcome.

Does anyone have an explanation?

  • Perhaps you will find this question answered at http://stats.stackexchange.com/questions/4220 or even http://stats.stackexchange.com/questions/133369. If not, then please edit it to explain what the terms in this code mean: it's important that your question be understandable on its own without requiring readers to visit another site. – whuber May 26 '16 at 20:30
  • I believe it is precisely about not understanding that yfit is a density and that the "histogram" you mention is not a histogram at all, but rather is a bar chart (showing frequencies rather than frequency densities). I see nothing R-specific about this procedure, which is a standard one. – whuber May 26 '16 at 20:54
  • So do you know why diff(h$mids[1:2]) is needed? – readyready15728 May 26 '16 at 20:55
  • 1
    Think of it this way: yfit gives the heights of rectangles. diff(h$mids[1:2]) gives their bases. The product gives their areas. The so-called "histogram" is plotting areas (that is, frequencies) by means of bars whose heights represent the areas. So it all comes down to the formula for the area of any rectangle, area = base * height. This is explained in the links I first provided. – whuber May 26 '16 at 20:59
  • "diff(h$mids[1:2])*length(x)" - is the same as doing "h$counts/h$density". It is a multiplier which takes yfit (which is a density distribution) and scales it to frequencies exhibited in your data. – Mina May 26 '16 at 21:04

1 Answers1

0

Since the histogram is a bar chart with area = height (yfit from dnorm) times base ”diff(h$mids[1:2])” the area converts the bar chart area to a probability area so final yfit (which is freq of occurrence) becomes probability (or area) times number of observations classical formula is

$prob = \frac{freq occurrence}{total possible occurrence}$

Here is the mapping to classical formula

yfit             =     yfit * diff(hmids[1:2]) *   length(x) 
freq occurrence  =     probability area         *   total occurrences
Ferdi
  • 5,179