Preparing binary matrix data for Scikit classification algorithms

Question

I made this post in regular stack overflow but I was told about this awesome feature by @nbryans.

I am a researcher (my programming knowledge is small) conducting analysis on a set of antibiotic (methicillin) resistant and a set of antibiotic (methicillin) susceptible. The sole difference between these two sets is assumed to be the resistance (R) vs susceptibility (S). I want to find if a genetic element in the resistant or susceptible genomes is correlated to resistance (positively or negatively).

I have done a complete fisher exact test already and will detail how I did so below. However I would like to figure out how to prepare my data for a more complex learning algorithm such as those offered by sci-kit.

(many of my techniques were learned from this article)

First I used KMC to split each fasta file for both R and S genomes into k-mers (I don't have enough "reputation" so I can't use >2 links but k-mers are essentially genetic sequences of k length). The output for KMC was a dump of k-mers present in a genome.

Example: kmc dump

AAAAAAAAAACGAATGTACACAATCGAA    1
AAAAAAAAACGAATGTACACAATCGAAC    1
AAAAAAAAATCAAATCCTGACTATTTAG    1
AAAAAAAAATCGGTCAATTCATTAAAAG    1
AAAAAAAACAAACATGAAGACCTTGTTA    1
AAAAAAAACGAATGTACACAATCGAACA    1
AAAAAAAACGTGTTAAAGTGAATCACAC    1

I then merged the data into one binary matrix which had columns as genomes and rows as k-mers.

Example: 1 resistant genome and 1 susceptible genome

kmer    R_1    S_1
AAAAAAAAAACGAATGTACACAATCGAA    1   0
AAAAAAAAACGAATGTACACAATCGAAC    1   0
AAAAAAAAATCAAATCCTGACTATTTAG    1   0
AAAAAAAAATCGGTCAATTCATTAAAAG    1   0
AAAAAAAAATTCCCTTCTAATCTTGAAT    0   1
AAAAAAAACAAAAATTATATAAAGCGAA    0   1
AAAAAAAACAAACATGAAGACCTTGTTA    1   1
AAAAAAAACAACCACCCATACATTGAGT    0   1
AAAAAAAACCCTTACAACAAATATGTAA    0   1

This is the data I have been working with and would like to analyze further with sci kit or other classification algorithms. Essentially each k-mer is a feature whose correlation with resistance is based on its presence in R vs S genomes. This is where I would like help transitioning this data into a dataset format suitable for sci kit.

I conducted a Fisher Exact test on each k-mer using R. For each k-mer a 2x2 matrix was created with the first column as resistant and the second column as susceptible. The two rows were present or not. So four numbers (# times k-mer was present in R, # in susceptible, # not in R, # not in S).

      R    S
yes   #    #
no    #    #

I used the following code in R:

phenotype = as.numeric(grepl('^R_', colnames(raw_kmer_table)[2:ncol(raw_kmer_table)]))
fe_results = apply(kmer_pres_abs_matrix, 1, 
               FUN = function(row) {
                 fe_mat = matrix(0, ncol = 2, nrow = 2)
                 fe_mat[1,1] = sum(row == 1 & phenotype == 1)
                 fe_mat[1,2] = sum(row == 1 & phenotype == 0)
                 fe_mat[2,1] = sum(row == 0 & phenotype == 1)
                 fe_mat[2,2] = sum(row == 0 & phenotype == 0)

                 fe = fisher.test(fe_mat)
                 return(fe$p.value)
               }
)

Now that you have read how my data is currently formatted I would like to hear suggestions on how I can fit this data into a more complex test using sci kit or another resource. Algorithms I am interested in using are any you see fit and previously field-tested algorithms such as adaboost and forest learning algorithms. Also please ask me any clarifications or things I may have left out of this post!

score 2 · Answer 1 · answered Jun 14 '17 at 15:33

Since you are using R, you probably don't want to use scikit-learn, which is for Python. However, there is a similar R library mlr ("R package to make machine learning in R easy") that provides a unified interface to all popular machine learning methods. Check their tutorial on how to get started.

Preparing binary matrix data for Scikit classification algorithms

1 Answers1