Integrative analysis of omics studies using machine learning

Question

I would like to use public omics datasets (ChIP-seq, RNA-seq, and ATAC-seq) from different studies to do an integrative analysis as follow:

Normalise samples, within each type of omics, from different public datasets.
Convert the normalised values into a uniform scale to make the comparison between ChIP-seq, RNA-seq and ATAC-seq possible.
Feed the normalised uniformed values into machine learning to infer one feature (e.g. RNA expression) from other features (e.g. TF or histone marks ChIP-seq).

How could I integrate these data-sets to predict one group of variables from another one?

This question is very general and opinion based. Could you please make the question more specific — Bioathlete, Nov 05 '18 at 15:09
What is your goal? Do you have some data at hand? Would the data of the different omic come from the same sample or from different subjects? Would the data be quite homogeneous or not? How many samples do you have for each dataset ? — llrs, Nov 05 '18 at 16:43
These are mouse samples, same background, so I am assuming homogeneity. I have at least 2 replicates per samples, where sample A --is a precursor to --> B --> C.
My goals are: a) to describe the dynamic of different histone marks during differentiation (A -> B -> C). b) to predict gene expression from these histone marks. — Firas, Nov 06 '18 at 05:46
Could you further elaborate that @Firas? How many samples do you have (6)? Do you have two replicates per A, B and C or more? Are they technical replicates or biological replicates? With a bit more of information I could possibly answer it — llrs, Nov 06 '18 at 14:28
If you are looking for transcription regulation, then the following article will give you a basic idea. https://www.cell.com/developmental-cell/fulltext/S1534-5807(16)30002-8 — arup, Nov 06 '18 at 19:55
@Llopis Many thanks, I have 5 histone chip-seq, atac-seq and RNA-seq in 2 technical replicates for each of A, B and C. — Firas, Nov 07 '18 at 01:56

score 2 · Accepted Answer · answered Nov 08 '18 at 20:02

The steps you describe are correct. For step 2 it is usually normalized to mean 0 and variance 1. However the "machine learning" part is important.

Having several samples being technical replicates will make the integration task easier. However, you have too few samples to make any good prediction. At most I would describe it as an exploratory analysis.

I recommend using regularized canonical correlation analysis for this kind of multi-omics analysis. The method seeks which are the features of one dataset that correlate (or covariate, depending on a parameter) with the variables of other datasets. Then you could use those features to predict the position of the other.

There are several implementations of this methods, the original in the RGCCA package, and mixOmics for other procedures with cross-validations also STATegRa has other related methods.

Integrative analysis of omics studies using machine learning

1 Answers1