5

I have a situation in which I am going to perform a study on live blue mussels. The research question is how the prevalence of mussels affects the prevalence of other animals living on the sea floor.

The setup for the field work is to examine a number of smaller squares (1x1 meters) located within larger areas (sites) (100x100 meters). So I will have squares within sites. The sites are separated by approximately one kilometer.

It is now the question of how to think when choosing the number of squares and number of locations? A tradeoff has to be made. Less sites gives room for more squares, and vice versa. For budget/practical reasons.

Furthermore, the practical limitations are roughly: Many sites= 15, few sites= 5. Many squares within sites =20, few squares within sites=10.

Do you have any guidance of how to make this trade-off between the number of sites and squares?

A note: In the litterature, there are access to rough estimates of between site variance and within site variances. It seems like the between and within variance are of similar magnitude. Does that point towards using a "medium" setup, with lets say 8 sites and 15 squares in each?

Literature tips on experiment design of nested designs in ecology are also welcome!

Edit: The study is purely observational. No sites or squares will be manipulated by moving mussels etc.

  • 1
    Is this a purely observational study, or are you going to do some experimental manipulations (like seeding different numbers of mussels into squares)? Please add that information by editing the question, as comments are easy to overlook and can be deleted. – EdM Sep 07 '22 at 20:36
  • Good point. Just edited the question. – Yung Gud Sep 08 '22 at 06:42

2 Answers2

2

This is an interesting Optimal Experiment Design (OED) problem. Finding the best experimental parameters (i.e. numbers of sites and squares) requires to specify a cost function. For instance, your research question could be formulated as a Bayesian inference problem:

Let us denote $x$ the mussels density and $y$ the density of other animals. Then the goal is to find a model that links both quantities, e.g. $$ y = f_{\theta}(x) $$ where $f$ is a function depending on a set of parameters $\theta$. Assuming that you have $T$ observations $x_{1:T}$ and $y_{1:T}$, your goal is to find the sites and squares that allow you to infer $\theta$ as precisely as possible, i.e. to minimize the variance or the entropy of the posterior distribution $p(\theta|x_{1:T},y_{1:T})$. This approach is called Bayesian experimental design.

Such a problem can be solved using brute-force simulations:

  1. Specify a set of ground-truth parameters $\theta^*$
  2. Generate surrogate data $x_{1:T}$ and $y_{1:T}$
  3. For different sites and squares, compute the posterior distribution $p(\theta|x_{1:T},y_{1:T})$ and study which experiment design has the best utility
  4. Repeat for different possible and realistic values of $\theta^*$
Camille Gontier
  • 2,616
  • 6
  • 13
1

Whether or not you consider this from a Bayesian perspective, the suggestion by Camille Gontier (+1) to use simulations is probably your most useful approach.

The risk at one extreme is having unrealistically low variance due to making multiple measurements over a geographically restricted area with just a few sites, leading to potential pseudoreplication and poor applicability of your model to new sites. The risk at the other extreme is sampling over many sites with too few replicates in each, so that the associations of interest might be dwarfed by variability among sites (like baseline abundances of species) that is poorly controlled for in your data.

Simulating data based on the literature and your understanding of the subject matter forces you to consider the various sources of variability within and between sites and to decide on a range of realistic estimates. Trying to model those simulated data forces you to think carefully about your modeling strategy and to evaluate the implications of those types of variability for the best combination of site numbers and squares within sites, consistent with your budget.

My guess is that you should maximize the number of sites provided that you can still have a few observations (squares) within each site. For example, if you treat the sites as random effects (clusters) in a mixed model, you typically want many clusters. As Robert Long put it: "In general, the number of clusters is more important than the number of observations per cluster." But the specifics of your situation might indicate otherwise. Informed simulations should point the way.

EdM
  • 92,183
  • 10
  • 92
  • 267