I am not necessarily interested in testing normality, but at least ensure that:
The mean is near 0
There is single mode
Tails are thinning out as you go further from the mean
Any ideas?
I am not necessarily interested in testing normality, but at least ensure that:
The mean is near 0
There is single mode
Tails are thinning out as you go further from the mean
Any ideas?
Distributions which satisfy those criteria can include distributions whose behavior is very different from the normal. For example, a $t$ distribution with two degrees of freedom satisfies those criteria, as does an asymmetric Laplace distribution, but each has very different properties from a normal.
I'll assume (though you don't state it) that you're primarily concerned with continuous variates. If you have discrete or even categorical data the advice would differ somewhat.
A common choice for assessing those sorts of things on data is the histogram, but caution is required; even small changes the binwidth and/or binorigin can in some cases lead to a big difference in appearance. Here's two histograms of the same data set:
It's often advisable to use narrower bins than the common defaults packages offer; we're trying to get a visual idea of shape and can smooth by eye. [The smoothing of the default settings is often quite strong.]
Another alternative which can work well, particularly in large samples, is the kernel density estimate; again, you may want to use a bit less smoothing than the default by choosing a narrower bandwidth. Here's an example using the same data as above, but with half the default bandwidth, because the default (dashed curve) obscures the clumpiness in the original data that produce the inconsistency in the histogram:
This can sometimes be useful for spotting multiple modes, though small modes can be hard to spot (or tell apart from noise) with any dislay.
Another option is the quantile-quantile plot, or Q-Q plot - the normal Q-Q plot is very common - and this can convey a lot of information about shape, tail behavior (particularly if you want to see if tails are heavier than or lighter than that of a normal, say), and symmetry, but it takes some practice to learn to read them.
In this case the unusual both the suggestion of mild right skewness and the odd clumpiness can be seen. Judging symmetry can be aided by also displaying a density estimate for $2M-x$ (where $M$ is some measure of center, perhaps the median for example). You don't have to calculate a new density estimate for that; the original just needs to be plotted against $2M-x$ instead of $x$.
rt(n,df=2)? I'm used to the nice hist(rnorm(1e4)), and trying to reproduce your images, I get really messed up histograms: set.seed(99); n <- 1e4; y <- rt(n, df = 2); summary(y); hist(y)
– Antoni Parellada
Nov 26 '15 at 01:23
dt). e.g. try looking at say dt2=function(x) dt(x,2);curve(dt2,-10,10);abline(h=0,col="dimgrey") The $t_2$ is pretty heavy tailed (its variance is not finite), so random values from it can be very large. Since a very large outlier can distort the appearance of most displays, you may want to plot say the middle 99% (for a standard $t_2$, try hist(y,xlim=c(-10,10),n=2000) which should typically contain just over 99% of the data values and for $10^4$ points looks reasonably like the curve above).
– Glen_b
Nov 26 '15 at 01:31
Error in hist(x, xlim = c(-10, 10), n = 1000) : object 'x' not found, but going back for a second to the initial code in my comment, if you don't mind it, I'm surprised that I can't get it to behave even changing it to: set.seed(99); n <- 1e4; y <- rt(n, df = 2); summary(y); hist(y, xlim=c(-4,4), breaks=6)
– Antoni Parellada
Nov 26 '15 at 01:37
y (mine were in x); sorry about that. I have now edited my comment so you can run it straight on your data. You need a lot of breaks (my n), since it specifies the number of breaks over the entire range, not the plotted range. Try breaks=2000 (equiv. n=2000) and use a wider range of values.
– Glen_b
Nov 26 '15 at 01:38
set.seed(99); n <- 1e4; y <- rt(n, df = 2); summary(y); hist(y, xlim=c(-5,5), breaks=2000). I can't believe how much is in the tails...
– Antoni Parellada
Nov 26 '15 at 01:44
library(MASS);truehist(y,xlim=c(-10,10)). That function doesn't base its binwidth on the range. In fact I think its builtin function uses nclass.FD which returns just under 2000 bins on your data set. (Edit: no, the Scott rule is default in truehist)
– Glen_b
Nov 26 '15 at 01:45