1

Within each run, the experiment is set up as below:

  • Genotypes refer to: WT (Wild type as blue), PKO (Partial Knockout in green), FKO (Full Knockout in red)
  • Biological triplicates means the same genotype grown in 3 different flasks, for example WT grown in one flasks, WT grown in second flask, WT grown in third flask.
  • Technical replicate means a single flask measured for 3 times.
  • 3 runs were perform to ensure reproducibility

Run 1: 3 genotypes, biological triplicates each sampled once i.e. no technical replicates. In total, 9 datapoints (black dots).

Run 2: 3 genotypes, biological triplicates each sampled thrice i.e. technical triplicates for each biological replicate. In total, 27 datapoints (black dots).

Run 3: 3 genotypes, biological triplicates each sampled thrice i.e. technical triplicates for each biological replicate. In total, 27 datapoints (black dots).

Preprocessing of the dataset part 1: Finding the concentration of each compound in each sample via spiked deuterated standards (picograms/ mg of protein) Concentration = (Area of compund/Area of deuterated standards) x amount of deuterated standards / total amount of protein.

Preprocessing of the dataset part 2: was done in Metaboanalyst 5.0 link - Autoscaling was used, and you see that across the 3 runs - in the right panel titled “Normalized Conc” have similar median and mean (yellow diamond).

Problem Because deuterated standards used to normalize are not not authentic standards the original concentration vary greatly between run. But it is clear at the normalized concentration that there is a Biological difference between the WT and the PKO and FKO. The PKO and FKO are expected to be same because PKO has 80% of the gene knockout while FKO 100%.

Question 1 Is this batch effect? If yes, what can I do to remove the batch effects

Question 2 How can I simplify the presentation and combine the 3 runs only to show the biological difference? My plan to redo the visualization is as follows, but I am not sure if it is correct:

#1: The whole is to compare biological difference so I plan to combine the technical replicates as they belong to the sample.

#2: I am thinking to work with the “Normalized Conc.” because it seems obvious (easiest) to me but I am not sure what kind of statistics I can do to combine the normalized concentrations across the 3 runs.

I feel like I am missing some fundamentals in statistics to be able to process this hence any advice will be appreciated.

  • Welcome to cv, Judy Mandy :-) Could you for the biological non experts explain what technical triplicates are? Does it mean that you measure the same thing for the same specimen 3 times and take the average? In general it would be useful to have information how the experiment was set up (how many individuals per genotype, what means "biological replicates", have the same individuals been used in each run), at least for a non-domain-expert. – Ute Jun 29 '23 at 15:43
  • 1
    Dear @Ute thank you! I have amended the description. Hope it is better now. – Jude Mandy Jun 30 '23 at 05:39
  • Are the displayed data for just a single compound? If not, what do the dots represent? If so, are you performing similar analyses on multiple compounds? How does a "normalized concentration" turn out to be less than 0, which seems to be the case for many PKO and FKO samples? Please provide that information by editing the question, as comments are easy to overlook and can be deleted. – EdM Jun 30 '23 at 12:39
  • I think your edits are fine, Jude Mandy (upvoted)! Somehow you lost the first sentence: "I ran metabolomics to detect a list of compounds, shown are 2 compounds (across) from 3 runs (columns).". You could pop that in again, and perhaps add how many compounds you looked at, say "I ran metabolomics to detect a list of 135 compounds; shown are 2 compounds (across) from 3 runs (columns)." – Ute Jun 30 '23 at 15:21
  • To address the comments by @EdM, you might add that you normalized data within each run and compound (it looks to me that you subtracted the mean of all measurements (across genotypes) and divided by the sd). It might also be interesting to know why there are quite big differences between runs - can we assume that all genotypes get the same form of "error" within a run? You said that the black dots correspond to data points in the description of the runs :-). – Ute Jun 30 '23 at 15:21

1 Answers1

1

Here are some recommendations to try:

Analysis of log(concentration)

These data are concentrations, and there is a clear batch effect, visible as a change of scale. It is very common and natural to register concentration data on a log scale, and in the present example, this has at least two advantages:

  • batch effects become additive
  • the variance within groups will become more comparable

Compare the plots from similar, simulated data below:

original scale: original simulated concentrations

logarithm of concentration:

logarithm of simulated concentration

Estimate and correct for batch effect using an ANOVA model

The following recommendation assumes that you have log-transformed concentrations:

Set up a linear model that reflects your experimental design. In particular, address these points:

  • does it make sense to assume that the batch effect affects all genotypes and all components in the same way within? From the data in your picture, this seems to be a reasonable assumption.
  • take into account that you have replicates on the same biological specimens in runs 2 and 3.

Then model runs as fixed effects, which allows you to adjust for batch effects by subtracting these effects from your data.

Now, after subtraction of the batch effect, you can visualize your data, either as log(concentration), or transform back to original concentration. Thus you avoid having negative concentrations as pointed out by @EdM.

The advantage of using an overall linear model is that you get somewhat more precise estimates for your batch effects, given the model assumptions are correct. If this seems too complicated with your, or if you have reasons to assume that batch effects vary between components within a run, you can still center data by subtracting the mean within run and component, but don't scale with the standard deviation when working with log(concentration)s.

After correction for a batch effect, you can treat the data as if they came from the same batch, and visualize them accordingly. Here is are box plots of batch adjusted log(concentration)s for the simulated example: enter image description here

This reflects the model I used for the simulation, where genotypes FKO and PKO had the same effect, and WT was larger by 1 on the log scale. Also, I assumed that acid9 had an effect of 1, which means that the original concentrations would be roughly three times as large as acid1 concentrations, in the same way for all genotypes. This makes the model very simple - it should be changed if that does not make sense biologically.

Transforming data back to the original scale gives enter image description here

Additions / Edits
# ---- setting up design matrix ----
library(dplyr)
drun1 <- data.frame(
  compo = gl(2, 9,  labels = c("acid1", "acid9")),
  geno  = rep(gl(3, 3, labels = c("FKO","PKO","WT"))),
  indiv = gl(9, 1, 18),
  run = 1)  
drun2 <- do.call("rbind",replicate(3,  drun1 %>% mutate(run = 2), 
                 simplify = FALSE)) |> 
  arrange(compo, geno, indiv)
drun3 <- mutate(drun2, run = 3)

design <- rbind(drun1, drun2, drun3) |> mutate(subject = factor(paste(run,indiv,sep="-")), run = factor(paste("run",run)))

---- simulation effects and noise ----

intercept <- 2 runef <- c(0, 1, -1) compef <- c(-1, 0) + 0.5 genef <- c(-1, -1, 0) +2/3 subjef <- rnorm(nlevels(design$subject), 0, 0.2) # a random effect noise <- rnorm(dim(design)[1], 0, 0.1)

data <- design |>
mutate (fixed = intercept + runef[run]+ compef[compo]+ genef[geno], subj = subjef[subject], lconc = fixed + subj + noise, conc = exp(lconc))

---- original plot ----

library(ggplot2) ggplot(data=data, aes(x=geno, y = conc, group = geno)) + geom_boxplot(aes(fill = geno)) + facet_wrap(compo~run, scales = "free")

---- estimating effects and correcting for batch effect ----

library(lme4) fitted_model <- lmer(lconc ~ run + compo + geno + (1|subject) - 1, data = data)

estimated_effect <- fixef(fitted_model) data <- data |> mutate(lconc_adj = lconc - estimated_effect[paste0("run", run)], conc_adj = exp(lconc_adj))

ggplot(data = data, aes(x=geno, y = conc_adj, group = geno)) + geom_boxplot(aes(fill = geno)) + facet_grid(. ~ compo) + ylab ("adjusted concentration")

Ute
  • 2,580
  • 1
  • 8
  • 22