Predicting relationships between variables from independent measures

Question

I have a series of experiments where a certain set of parameters (let's call them P1, P2, ...) have been quantified in single cells.

For technical reasons it is not possible to quantify all of these parameters at the same time, so I have different experiments where I measured subset of them.

For example let's say that I measured:

Experiment 1: P1, P2, P3, P4

Experiment 2: P1, P2, P5, P6

Experiment 3: P1, P3, P7, P8

...

I have thousands of measurements per experiment (and replicates for each set of parameters). Note that, although in each experiment I am measuring cells of the same type, they are different cells every time, so each experiment is independent from the others.

What I would like to do now is predict the relation between all of the parameters, from knowing the relationships between some of them.

For example, I would like to be able to predict the (likely) value of the missing parameter P5 to P8 from the data I have for experiment 1, based on what I know about their relation with the other parameters that I gain in experiment 2 and 3.

What would be the best way to approach this problem?

Could you clarify what you mean by 'predict the relation between all of the parameters'? Are you interested in understanding the functional form of the relationships? Or predicting the missing values (which multiple imputation deals with)? — mkt, Jul 09 '18 at 11:34
@mkt I have updated the question. Essentially I would like to predict the missing values. In other words, if now I have a set of data with P1, P2, P3 and P4 I would like to predict P5, P6, P7, and P8. Hope this makes more sense now. — nico, Jul 09 '18 at 14:04

mkt · Accepted Answer · 2018-07-11T09:15:29.300

The general approach you are after is termed 'imputation' (or 'multiple imputation', though that typically includes a subsequent analysis step as well). Broadly speaking, this involves filling in (i.e. imputing) missing values in datasets.

There are a variety of techniques that can be used for this, which fall into two broad classes.

Fill in the missing values with guesses based on the non-missing values of the same variable. Typically, the mean (or median, mode, etc.) of the non-missing values is used as the best guess for the missing values. Note that this approach does not use the information on the relationships between variables.
A better approach is to construct a model based on the non-missing values, and use this to predict the missing values. In your case, this model could be a multiple linear regression, random forest, gradient boosting machine, or other such technique.

The second approach has an additional advantage: it can incorporate uncertainty in imputed values. Basically, you can generate a number of different imputed datasets, each of which differs in its imputed values to a degree based on the uncertainty in the underlying model.

For example, if a multiple regression identifies at best a weak relationship between parameter A and the remaining parameters, then each imputed dataset will have large differences in the imputed values for parameter A. Conversely, if parameter B is strongly explained by the remaining parameters, the imputed datasets will vary only slightly in the imputed values of parameter B.

The utility of incorporating this uncertainty is that subsequent analyses can be run for each imputed dataset, and then the results across all imputed datasets pooled to generate a final answer that accounts for imputation uncertainty.

So I would approach your problem by:

Pooling your data from all experiments.
Generating a large number of imputed datasets (using the second approach).
Running identical analyses on each imputed dataset.
Finally, pooling the results from all imputed datasets.

If you use R, these imputation approaches (with several variants) and the subsequent pooling of analyses can be done with the mice package. This is a tutorial explaining how to do this using the mice package. And a book describing the approach by the author of the package:

Van Buuren, S. (2012). Flexible imputation of missing data. Chapman and Hall/CRC.

Also take a look at the multiple-imputation tag for related questions: https://stats.stackexchange.com/questions/tagged/multiple-imputation?sort=votes&pageSize=50

Thank you very much mkt! Much of my issue was that I did not know it was called "multiple imputation", hence I did not know what to look for! Your answer and links definitely help, thanks! — nico, Jul 12 '18 at 07:51

Predicting relationships between variables from independent measures

1 Answers1