0

I have a problem with r. This is a Snapshot of my real dataset:

My Dataset Snapshot

I want to create a variable which indicates if at least one gene from a list of genes that I have is present in column D of my dataset(if its there=1, if not=0).

-an example of a list of genes that interest me : gene<-c("gene1|gene2|gene3|gene4")

the column D in my data set matches a variable that indicates the genes present in each individual(a set of genes per individual per line, separated by ,).

in my real dataset the genes in column D are separated by ,

Which function can I use?

Jason Aller
  • 3,475
  • 28
  • 40
  • 37
Manel
  • 3
  • 1
  • 4

1 Answers1

2

You really shouldn't store multiple words in the same element. Make vectors like this:

genes <- c("gene1","gene2","gene3","gene4","gene5")

Anyway, assuming that you work with a data frame called df and assuming that your fourth column entries are indeed one single string where genes are separated by commas:

lis <- strsplit(df[,4], ",")

This will give is a list instead of a data frame, where every element contains all the genes separately. Next, make a list of the genes you are interested in (like above). Finally, do:

tab <- sapply(lis,function(x) any(genes %in% x))

Basically, for each row, %in% will check for each genes if it is in there. Next, the any command will return TRUE if any of the comparisons returns TRUE. So, if any of the genes is found in x, then it returns the value TRUE.

For example:

df <- structure(list(col1 = 1:10, col2 = 1:10, col3 = 1:10, col4 = c("gene1,gene2,gene3", 
"gene2,gene3", "gene6,gene8", "gene9,gene10", "gene1,gene2,gene10", 
"gene5", "gene3,gene6", "gene1,gene2,gene8", "gene6,gene7", "gene1,gene4"
)), .Names = c("col1", "col2", "col3", "col4"), row.names = c(NA, 
-10L), class = "data.frame")

genes <- c("gene1","gene2","gene3","gene4","gene5")

lis <- strsplit(df[,4], ",")
tab <- sapply(lis,function(x) any(genes %in% x))
tab
# [1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

df
#    col1 col2 col3               col4
# 1     1    1    1  gene1,gene2,gene3
# 2     2    2    2        gene2,gene3
# 3     3    3    3        gene6,gene8
# 4     4    4    4       gene9,gene10
# 5     5    5    5 gene1,gene2,gene10
# 6     6    6    6              gene5
# 7     7    7    7        gene3,gene6
# 8     8    8    8  gene1,gene2,gene8
# 9     9    9    9        gene6,gene7
# 10   10   10   10        gene1,gene4

Edit: Adjusted script according to clearer description.

slamballais
  • 3,081
  • 3
  • 17
  • 29
  • First thank you for your quick answer. I tried your Script but when i use : lis – Manel Jan 29 '16 at 04:51
  • I think i resolved the problem of string error by : data1$RefSeqGenes – Manel Jan 29 '16 at 05:22
  • I updated the script so that if any of the genes is present it will be set to `TRUE`, otherwise `FALSE`. If you want `1` and `0`, just write `tab – slamballais Jan 29 '16 at 07:53
  • I did a check up manually for some genes to be sure that the script is working and it is working!!! Thanks a lot for the help. – Manel Feb 01 '16 at 02:01
  • I would have another question please. What if in place of having only one list of gene that interest me ( list1 – Manel Feb 02 '16 at 03:00
  • Thanks a lot for the quick reply – Manel Feb 02 '16 at 19:26
  • A last question please.Do you think that in R it would be possible to select not only by genes but also by coordinate. as you can see in the snapshot that i have put upper in column 4 i have the start coordinate of the gene and in column 5 the stop coordinate.i want to select for exemple when a row contain a gene X and also have a coordinate starting at 100000 ending at 200000 with an overlap of 80%. it mean that if i have the gene X and the coordinate are (100000-200000 or 80000-180000 or 120000-220000 ...) it will take it. so at least 80% of the region (100000-200000) is present for gene X. – Manel Feb 02 '16 at 20:01
  • That's a completely different kind of question. You may want to Google around for that. For example, this sounds very similar: http://stackoverflow.com/questions/24766104/checking-if-value-in-vector-is-in-range-of-values-in-different-length-vector Also, check out `?findInterval` – slamballais Feb 02 '16 at 20:09
  • Thanks i will post a new topic because i could not find exactly how to do it – Manel Feb 02 '16 at 20:44