6

I have a binary matrix of gene presence or absence which looks like: [roary output]

Gene    sample1 sample2 sample3 sample4
fliI        1       1       1        1
patB_1      1       1       1        1
pgpA        1       1       1        1
osmB        1       1       1        1
cspA        1       0       1        1

How can I convert to fasta so it looks like and technically aligned:

>sample1
11111
>sample2
11110

Is it possible to do this in R?

AudileF
  • 955
  • 8
  • 25

3 Answers3

8

It's not a fasta file, but:

> m
       sample1 sample2 sample3 sample4
fliI         1       1       1       1
patB_1       1       1       1       1
pgpA         1       1       1       1
osmB         1       1       1       1
cspA         1       0       1       1
> # Collapse to labeled strings
> blah = apply(m, 2, function(x) paste(x, collapse=''))
> blah
sample1 sample2 sample3 sample4 
"11111" "11110" "11111" "11111" 
> # Write that to a file with the appropriate format
> cat(paste(mapply(function(x, y) sprintf(">%s\n%s\n", x, y), names(blah), blah), collapse=""), file="some file.fa")

Make sure to change some file.fa. I have no idea why you would want to do all of this, but this will give you the output you want:

$ cat blah.txt 
>sample1
11111
>sample2
11110
>sample3
11111
>sample4
11111
Devon Ryan
  • 19,602
  • 2
  • 29
  • 60
  • Thanks Devon. This looks great. I want it for phylogenetic analysis so its important the layout comes out as in the matrix so its technically aligned. – AudileF Jul 21 '17 at 13:34
4

A mixture of apply, paste and cat will do this:

> data.mat
       sample1 sample2 sample3 sample4
fliI         1       1       1       1
patB_1       1       1       1       1
pgpA         1       1       1       1
osmB         1       1       1       1
cspA         1       0       1       1
> data.conv <- apply(data.mat,2,paste,collapse="") # aggregate along 2nd dim
> cat(paste0(">", names(data.conv), "\n", data.conv), sep="\n")
>sample1
11111
>sample2
11110
>sample3
11111
>sample4
11111

If writing to a file is required, just add a file parameter to the cat command:

cat(paste0(">", names(data.conv), "\n", data.conv), 
    sep="\n", file="out_fastaLike.txt")
gringer
  • 14,012
  • 5
  • 23
  • 79
3

One could also use cut and a pair of Python scripts:

transpose.py

#!/usr/bin/env python                                                                                                                                                                                                                                                                                                                       

import sys

for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip())):
    sys.stdout.write("%s\n" % ('\t'.join(c)))

mock_fasta.py

#!/usr/bin/env python                                                                                                                                                                                                                                                                                                                       

import sys

for e in (l.split() for l in sys.stdin.readlines() if l.strip()):
    sys.stdout.write(">%s\n%s\n" % (e[0], ''.join(e[1:])))

Then:

$ cut -f2- in.mtx | transpose.py | mock_fasta.py > in.fa

Which gives:

$ more in.fa
>sample1
11111
>sample2
11110
>sample3
11111
>sample4
11111

Why anyone would use R for text parsing is beyond me, but kids these days etc.

Alex Reynolds
  • 3,135
  • 11
  • 27