1

I am working with the data from some manufacturing line. The parts are processed in batches of say 20. In order to control process, we do measurements on manufactured parts. We know that the variations are caused by (1) some systematic sources that is equal for all the parts in a batch but randomly varied between the batches and (2) some random sources that is randomly distributed for parts inside each batch.

I am assuming that both systematic and random variations are normally distributed and uncorrelated. This will produce a total probability density of $\mathcal{N}(\mu, \sqrt{\sigma_s^2+\sigma_r^2})$

My goal is to estimate $\sigma_s$ and $\sigma_r$ given some data set that has the information about the batch association. Table below shows an example of the dataset.

Part number Batch number measurement value
part 1 batch 1 value 1
part 2 batch 1 value 2
part 3 batch 2 value 3
part 4 batch 2 value 4
... ... ...

Normally, I would use MLE for estimating the parameters of a distribution, but using it here will ignore the information about the batch association and only give an estimate about the total standard deviation $\sigma_t = \sqrt{\sigma_s^2+\sigma_r^2}$.

Is there any formal way to extract the parameters using the hierarchical information like above?

Ute
  • 2,580
  • 1
  • 8
  • 22
dvd8719
  • 55
  • 4

1 Answers1

3

What you describe is a simple linear mixed model, also known as variance component model or random effects model.$\newcommand{\E}{\mathbb{E}}\newcommand{\Var}{\mathrm{Var}}$

In a formal description, let $X_{ij}$ be the $j$-th part in batch $i$, where $i=1,\dots, m$, $j=i,\dots,n_i$. The total number of observations is $N=\sum_{i=1^m}n_i$. Your model can be written as $$ X_{ij}=\mu + B_{i} + \epsilon_{ij}, $$ with independent batch effects $\E B_i=0,\Var B_i=\sigma^2_s$ and remaining errors $\E \epsilon_{ij}=0,\Var \epsilon_{ij}=\sigma^2_r$ ; you furthermore assumed normality.

There exist a number of methods to estimate the variances $\sigma^2_s$ and $\sigma^2_r$. Whole books have been published on that topic (e.g. Searle, Casella, McCulloch: Variance Components (1992, Wiley)).

For an overview see the e.g. introduction of Teunissen and Amiri-Simkooei (2008) J Geodesy 82, 65-82

Balanced design

If your design is balanced, i.e. the number of parts in each batch is the same, $n_1=n_2=..=n$, estimation of $\sigma^2_s$ and $\sigma^2_r$ becomes particularly easy.

You get unbiased estimates

  • for $\hat\sigma^2_r$ calculate the variances within each batch and take the average: $$ \begin{aligned} \hat{\sigma}^2_r &=\frac{1}{m}\sum_{i=1}^m\left(\frac{1}{n-1}\sum_{j=1}^n (x_{ij}-\bar{x}_{i .})^2\right),\quad \bar{x}_{i .}=\frac{1}{n}\sum_{j=1}^n x_{ij} \\&=\frac{1}{N-m}\sum_{i=1}^m\sum_{j=1}^n (x_{ij}-\bar{x}_{i .})^2 \end{aligned} $$
  • estimation of $\sigma_s^2$ is based on the group averages $\bar{x}_{i.}$. Since $\bar{X}_{i.}=\mu+B_i+\bar{\epsilon}_{i.}$ has variance $\Var(\bar X_{i.})=\sigma^2_s +\sigma^2_r/n$, we need to subtract the estimate $\sigma^2_r/n$ to get the unbiased estimate $$ \hat\sigma^2_s = \frac{1}{m-1}\sum_{j=1}^n (\bar{x}_{i.}-\bar{x}_{. .})^2 - \frac{\hat{\sigma}^2_r}{n},\quad \bar{x}_{. .}=\frac{1}{N}\sum_{i=1}^m\sum_{j=1}^n x_{ij}. $$ You can get these estimates "manually" from group means and group variances. Here is an example in R with simulated data. Measurements are stored in variable x and batch number in a factor variable b.
# simulation of balanced design
m <- 16
n <- 20 
N <- n*m

mu <- 30 sd_batch <- sqrt(50) sd_r <- 1

set.seed(42)

batch effect

B <- rep(rnorm(m, 0, sd_batch), each=n)

remaining error

e <- rnorm(N, 0, sd_r)

data <- data.frame( batch = factor(rep(1:m, each=n)), x = mu+ B + e )

manual estimation

s2r <- mean(tapply(data$x, data$batch, var)) s2sr <- var(tapply(data$x, data$batch, mean)) s2s <- (s2sr-s2r/n)

c(sigma2_s = s2s, sigma2_r=s2r) #> sigma2_s sigma2_r #> 48.8063311 0.9352819

Created on 2023-08-23 with reprex v2.0.2

General case

When the experimental design is not balanced, that is, batch sizes differ, it is easies to use standard software for linear mixed models. The most common method used to fit variances is Restricted Maximum Likelihood Estimation (REML) - you can find many threads on cv about REML, see e.g. What is "restricted maximum likelihood" and when should it be used?.

In R, I would recommend to use the package lme4. Using the function lmer on the data from the manual example above, you get

library(lme4)
#> Loading required package: Matrix
summary(lmer(x ~ (1|batch), data = data, REML = TRUE))
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: x ~ (1 | batch)
#>    Data: data
#> 
#> REML criterion at convergence: 994
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -3.2005 -0.6911  0.0048  0.6538  3.0709 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev.
#>  batch    (Intercept) 48.8063  6.9862  
#>  Residual              0.9353  0.9671  
#> Number of obs: 320, groups:  batch, 16
#> 
#> Fixed effects:
#>             Estimate Std. Error t value
#> (Intercept)   33.446      1.747   19.14

The model formula, x ~ (1|batch), specifies that there is a random intercept for each batch, corresponding to the variance component $B_i$ in the model. For more complicated models, cv users have collected tips on model specification in R's lmer cheat sheet

dvd8719
  • 55
  • 4
Ute
  • 2,580
  • 1
  • 8
  • 22