4

I'm looking for some options for imputation for a high-dimensional dataset of DNA methylation (bisulfite sequencing) data. Dimensions on the order of 50-100 samples x ~500,000 CpG loci/features.

I've used K-nearest neighbors, but It seems that this method is not very accurate. As far as I can tell, it limits the minimum number of "genes/features" to impute at one time to something like 1500. which is kind of a lot.

I've also used missForest to impute smaller datasets with greater accuracy, but it seems computationally infeasible to do it on the full 50x500,000 dataset.

I need to impute values because I'm using the data for some statistical modeling which require complete cases.

Anyone know of better alternatives than K-nearest neighbor for large scale imputation?

Reilstein
  • 367
  • 1
  • 14

1 Answers1

2

I think this is still an active field of research. I have heard of Phenix, which might be appropriate.

winni2k
  • 2,266
  • 11
  • 28
  • I agree it is still an active area of research. I will check out Phenix, looks like it could be promising. Thanks. – Reilstein Sep 20 '17 at 03:37
  • @Reilstein: can you say anything about your experience with Phenix? – winni2k Dec 21 '17 at 10:17
  • I never ended up using Phenix because it appears to be tailored for samples/individuals with a degree of relatedness. My particular study doesn't have samples from related individuals. I opted for K-nearest neighbor imputation while keeping the total missing values allowed below 5% prior to imputation. – Reilstein Dec 21 '17 at 19:10
  • The description says "arbitrary level of relatedness". From my discussions with the author, I would guess that the level of appropriate relatedness includes the level of relatedness of "unrelated" human individuals. – winni2k Jan 13 '18 at 15:41
  • ah okay, well I didn't look into it in enough depth apparently. Thanks for following through with the author. If I decide KNN Imputation isn't accurate enough for my purposes I will revisit this. Thanks. – Reilstein Jan 15 '18 at 04:20