12

I'm not a statistician. So, please bear with my blunders if any.

Would you please explain in a simple manner how simulation is done? I know that it picks some random sample from a normal distribution and use for simulating. But, don't understand clearly.

Curious
  • 129

4 Answers4

30

In statistics, simulation is used to assess the performance of a method, typically when there is a lack of theoretical background. With simulations, the statistician knows and controls the truth.

Simulation is used advantageously in a number of situations. This includes providing the empirical estimation of sampling distributions, studying the misspecification of assumptions in statistical procedures, determining the power in hypothesis tests, etc.

Simulation studies should be designed with lots of rigour. Burton et al. (2006) gave a very nice overview in their paper 'The design of simulation studies in medical statistics'. Simulation studies conducted in a wide variety of situations may be found in the references.

Simple illustrative example Consider the linear model

$$ y = \mu + \beta * x + \epsilon $$

where $x$ is a binary covariate ($x=0$ or $x=1$), and $\epsilon \sim \mathcal{N}(0, \sigma^2)$. Using simulations in R, let us check that

$$ E(\hat{\beta}) = \beta. $$

> #------settings------
> n <- 100            #sample size                          
> mu <- 5             #this is unknown in practice                         
> beta <- 2.7         #this is unknown in practice
> sigma <- 0.15       #this is unknown in practice
> #--------------------
> 
> #------set the seed so that this example can be replicated------
> set.seed(937)
> #---------------------------------------------------------------
>
> #------generate 1000 data sets and store betaHat------
> betaHat <- numeric(1000)
> for(i in 1:1000)
+ {
+     #generate the binary covariate --> n Bernoulli trials
+   x <- sample(x=c(0, 1), size=n, replace=TRUE, prob=c(0.5, 0.5))
+     #generate the errors
+   epsilon <- rnorm(n=n, mean=0, sd=sigma)
+     #form the response variable      
+   y <- mu + beta * x + epsilon 
+     #the ith generated data set
+   data_i <- data.frame(y=y, x=x)
+     #fit the model
+   mod <- lm(y~x, data=data_i)
+     #store the estimate of beta
+   betaHat[i] <- as.numeric(coef(mod)[2])     
+ }    
> #-----------------------------------------------------
> 
> #------E(betaHat) = beta?------
> mean(betaHat)
[1] 2.698609
> #------------------------------

Note: There is a letter to the editor for the paper referenced above.

ocram
  • 21,851
10

First of all, there's many, many different types of simulation in statistics, and even more in the surrounding fields. Just saying "Simulation" is about as useful as saying "Model" - that is to say, not much at all.

Based on the rest of your question, I'm going to guess you mean Monte Carlo simulation, but even that's a little vague. Basically, what happens is you repeatedly draw samples from a distribution (it need not be normal) in order to do some statistical analysis on an artificial population with known, but random, properties.

The purpose of this tends to fall into two categories:

Can My Method Handle X?: Essentially, you're simulating a series of many random populations with a known "right" answer to see if your new technique gives you back said right answer. As a basic example, lets say you've developed what you think is a new way measuring the correlation between two variables, X and Y. You'd simulate two variables where the value of Y is dependent on the value of X, along with some random noise. For example, Y = 0.25x + noise. You'd then create a population with some random values of X, some values of Y that were 0.25x + a random number, likely many many thousands of times, and then show that, on average, your new technique spits out a number that properly shows that Y = 0.25x.

What Happens If? Simulation can be done as a sensitivity analysis for an existing study. Lets say for example I've run a cohort study, but I know my exposure measurement isn't very good. It incorrectly classifies 30% of my subjects as exposed when they shouldn't be, and classifies 10% of my subjects as unexposed when they shouldn't be. The problem is, I don't have a better test, so I don't know which is which.

I'd take my population, and give each exposed subject a 30% chance of switching to unexposed, and each unexposed subject a 10% chance of switching to exposed. I'd then make thousands of new populations, randomly determining which subjects switch, and re-run my analysis. The range of those results will give me a good estimation of how much my study result might change if I could have correctly classified everyone.

There is of course, as always, greater complexity, nuance and utility to simulation, depending on how much you want to dig.

Fomite
  • 23,134
  • So what you explained in your answer is Monte-Carlo simulation? 2. Are there other kind of simulations (other than Monte-Carlo) that are used in statistics?
  • – vasili111 Sep 22 '19 at 02:23