While browsing for information on how I might plot a fitted normal curve over a histogram, I found the following:
http://www.statmethods.net/graphs/density.html
There is a line I don't fully understand, though I recognize that it really does work:
yfit <- yfit*diff(h$mids[1:2])*length(x)
Here, yfit is initially a list of values drawn from the pdf of an inferred normal distribution at regular intervals along the x-axis, length(x) is the number of observations in a list x from which a histogram was prepared, and diff(h$mids[1:2]) is the difference between the midpoints of the second and first bars of said histogram on the x-axis. After this statement is run, yfit becomes itself multiplied by those other two terms.
I understand that multiplying by length makes sense as this turns values for a probability distribution function into number of observations around each respective value—taking into account that a continuous pdf is being used here and the number of observations at any single point is zero.
I don't understand why it is necessary to multiply by diff(h$mids[1:2]) to get the right outcome in the graph, although I can confirm that it does get the right outcome.
Does anyone have an explanation?
yfitis a density and that the "histogram" you mention is not a histogram at all, but rather is a bar chart (showing frequencies rather than frequency densities). I see nothingR-specific about this procedure, which is a standard one. – whuber May 26 '16 at 20:54yfitgives the heights of rectangles.diff(h$mids[1:2])gives their bases. The product gives their areas. The so-called "histogram" is plotting areas (that is, frequencies) by means of bars whose heights represent the areas. So it all comes down to the formula for the area of any rectangle, area = base * height. This is explained in the links I first provided. – whuber May 26 '16 at 20:59