Explanation of statistical simulation

Question

I'm not a statistician. So, please bear with my blunders if any.

Would you please explain in a simple manner how simulation is done? I know that it picks some random sample from a normal distribution and use for simulating. But, don't understand clearly.

Exactly nitpicking, but sometimes questions are discouraging to the asker, and this is one such case. — amit kumar, Feb 08 '12 at 13:05
@phaedrus Is this a comment to my intention? If so, could you explain what you understood from the above question? — chl, Feb 09 '12 at 20:27
This CV question may also be of interest: Using computer simulations to better understand statistical concepts at the graduate level. — gung - Reinstate Monica, Nov 06 '12 at 14:23

ocram · Answer 1 · 2012-11-06T11:27:01.277

In statistics, simulation is used to assess the performance of a method, typically when there is a lack of theoretical background. With simulations, the statistician knows and controls the truth.

Simulation is used advantageously in a number of situations. This includes providing the empirical estimation of sampling distributions, studying the misspecification of assumptions in statistical procedures, determining the power in hypothesis tests, etc.

Simulation studies should be designed with lots of rigour. Burton et al. (2006) gave a very nice overview in their paper 'The design of simulation studies in medical statistics'. Simulation studies conducted in a wide variety of situations may be found in the references.

Simple illustrative example Consider the linear model

$$ y = \mu + \beta * x + \epsilon $$

where $x$ is a binary covariate ($x=0$ or $x=1$), and $\epsilon \sim \mathcal{N}(0, \sigma^2)$. Using simulations in R, let us check that

$$ E(\hat{\beta}) = \beta. $$

> #------settings------
> n <- 100            #sample size                          
> mu <- 5             #this is unknown in practice                         
> beta <- 2.7         #this is unknown in practice
> sigma <- 0.15       #this is unknown in practice
> #--------------------
> 
> #------set the seed so that this example can be replicated------
> set.seed(937)
> #---------------------------------------------------------------
>
> #------generate 1000 data sets and store betaHat------
> betaHat <- numeric(1000)
> for(i in 1:1000)
+ {
+     #generate the binary covariate --> n Bernoulli trials
+   x <- sample(x=c(0, 1), size=n, replace=TRUE, prob=c(0.5, 0.5))
+     #generate the errors
+   epsilon <- rnorm(n=n, mean=0, sd=sigma)
+     #form the response variable      
+   y <- mu + beta * x + epsilon 
+     #the ith generated data set
+   data_i <- data.frame(y=y, x=x)
+     #fit the model
+   mod <- lm(y~x, data=data_i)
+     #store the estimate of beta
+   betaHat[i] <- as.numeric(coef(mod)[2])     
+ }    
> #-----------------------------------------------------
> 
> #------E(betaHat) = beta?------
> mean(betaHat)
[1] 2.698609
> #------------------------------

Note: There is a letter to the editor for the paper referenced above.

score 10 · Answer 2 · answered Feb 07 '12 at 23:14

First of all, there's many, many different types of simulation in statistics, and even more in the surrounding fields. Just saying "Simulation" is about as useful as saying "Model" - that is to say, not much at all.

Based on the rest of your question, I'm going to guess you mean Monte Carlo simulation, but even that's a little vague. Basically, what happens is you repeatedly draw samples from a distribution (it need not be normal) in order to do some statistical analysis on an artificial population with known, but random, properties.

The purpose of this tends to fall into two categories:

Can My Method Handle X?: Essentially, you're simulating a series of many random populations with a known "right" answer to see if your new technique gives you back said right answer. As a basic example, lets say you've developed what you think is a new way measuring the correlation between two variables, X and Y. You'd simulate two variables where the value of Y is dependent on the value of X, along with some random noise. For example, Y = 0.25x + noise. You'd then create a population with some random values of X, some values of Y that were 0.25x + a random number, likely many many thousands of times, and then show that, on average, your new technique spits out a number that properly shows that Y = 0.25x.

What Happens If? Simulation can be done as a sensitivity analysis for an existing study. Lets say for example I've run a cohort study, but I know my exposure measurement isn't very good. It incorrectly classifies 30% of my subjects as exposed when they shouldn't be, and classifies 10% of my subjects as unexposed when they shouldn't be. The problem is, I don't have a better test, so I don't know which is which.

I'd take my population, and give each exposed subject a 30% chance of switching to unexposed, and each unexposed subject a 10% chance of switching to exposed. I'd then make thousands of new populations, randomly determining which subjects switch, and re-run my analysis. The range of those results will give me a good estimation of how much my study result might change if I could have correctly classified everyone.

There is of course, as always, greater complexity, nuance and utility to simulation, depending on how much you want to dig.

score 5 · Answer 3 · answered Feb 05 '12 at 17:17

Simulation can also be used to look at real processes under theoretical conditions, where those processes have nonlinear inputs. For example, a manufacturing firm may be interested in whether adding an extra production line is cost effective, a call centre may be interested in how to route calls around operators to reduce time-in-queue and balking rates for callers, an emergency department may be interested in how best to roster staff and transfer patients, or a shipping port may be interested in the most efficient way to layout its container operations. Discrete event simulation can be used to model these processes, and one can adjust the parameters to answer "what if" type questions. These types of simulation are important because in real life situations one often cannot implement the proposed change and then see what happens because of costs, time, complexity, etc.

Another area of interest in simulation is complex systems. Particularly in social science, agent-based simulation is an interesting type of simulation that is starting to gather more proponents. In agent-based simulation, the agents (e.g. individual people) are given attributes such as personalities and interact with each other, so it models a chaotic system. Agent-based simulation looks at the effect of surrounding agents on each other, and effect-at-a-distance can be included. While I have not done any agent-based simulations myself, I have seen it used to model systems such as the geographical spread of population size in a prehistoric community over time.

I'm not sure what you mean by some examples. I gave some examples in my first paragraph. — Michelle, Sep 22 '19 at 23:07

Xi'an · Answer 4 · 2012-02-07T13:43:14.610

Simulation reproduces the randomness inherent to a statistical sample $x_1,\ldots,x_n$ by using a pseudo-random generator (e.g., a normal generator like rnorm) and uses the reproducibility of the pseudo-random generation to infer about the distribution of a statistical procedure applied to the original sample.

A particularly importance instance of simulation-based statistical technique is linked with the bootstrap, introduced by Efron (1979). Given a sample $x_1,\ldots,x_n$, the empirical cdf $$ \hat F_n(x) = \frac{1}{n}\sum_{i=1}^n \mathbb{I}_{x_i\le x} $$ is a convergent (in $n$) approximation to the true cdf, $F$. Therefore, any quantity depending on $F$, e.g. an expectation, $\mathbb{E}_F[h(X)]$, or the distribution of a statistic $\psi(X_1,\ldots,X_n)$, can be approximated by the corresponding quantity under $\hat F_n$. Which can only be evaluated by simulation, except for special cases. For instance, determining the bias of $$ \hat \sigma^2_n (x_1,\ldots,x_n) = \frac{1}{n+1} \sum_{i=1}^n (x_i-\bar x)^2 $$ as an estimator of $\sigma^2=\text{var}_F(X)$ can be done by bootstrap: replicate samples of size $n$ $x^*_1,\ldots,x^*_n$ from $\hat F_n$ and compute the difference $$ \beta= \hat \sigma^2_n (x^*_1,\ldots,x^*_n) - \hat \sigma^2_n (x_1,\ldots,x_n) $$ This produces a simulated bootstrap evaluation of the bias.

I believe it is useful to separate two essential concepts behind the bootstrapping method. The bootstrap itself ought to be thought of as a way of modifying an estimator to produce another (hopefully better) estimator. It can be computed theoretically, exactly, and (sometimes) in closed form. The simulation is not an inherent part of the bootstrap! However, in many cases simulation is a natural and easy way to approximate the bootstrap estimator. See the introduction and chapter 1 in Hall, The Bootstrap and Edgeworth Expansion. — whuber, Feb 06 '12 at 16:08
Concerning the edit: this actually is a nice example of a case where the bootstrap bias estimate can be computed exactly in closed form: $E[\beta\ |\ \text{sample}] = -[2/(n+1)] \hat \sigma^2_n (x_1,\ldots,x_n).$ — whuber, Feb 07 '12 at 16:52

Explanation of statistical simulation

4 Answers4

Linked

Related