1

I have a list here that looks like this:

head(h)
[[1]]
[1] "gene=dnaA"             "locus_tag=CD630_00010" "location=1..1320"     

[[2]]
character(0)

[[3]]
[1] "locus_tag=CD630_05950"   "location=719777..720313"

[[4]]
[1] "gene=dnrA"             "locus_tag=CD630_00010" "location=50..1320" 

I'm having trouble trying to manipulate this list to create a data.frame with three columns. For the rows with missing gene info, I want to list them as "gene=unnamed" and completely remove the empty rows into a matrix as shown:

     [,1]        [,2]                    [,3]                             
[1,] "gene=dnaA" "locus_tag=CD630_00010" "location=1..1320"              
[2,] "gene=thrA" "locus_tag=CD630_05950" "location=719777..720313"             
[3,] "gene=dnrA" "locus_tag=CD630_00010" "location=50..1320"            

This is what I have right now, but I get an error about missing values in the gene column. Any suggestions?

  h <- data.frame(h[lapply(h,length)>0])
  h <- t(h)
  rownames(h) <- NULL
alki
  • 3,044
  • 5
  • 19
  • 43

2 Answers2

1

There are a number of methods for binding lists with unequal lengths. See bind_rows from dplyr, rbind.fill from plyr or rbindlist from data.table. Here is using base R

## Sample data
h <- list(letters[1:3],
          character(0),
          letters[4:5])

out <- do.call(rbind, lapply(h, `length<-`, 3))  # fix lengths and make matrix
out <- out[rowSums(!is.na(out))>0, ]             # remove empty rows
out[is.na(out)] <- "gen=unnamed"                 # rename NA

data.frame(out)
#   X1 X2          X3
# 1  a  b           c
# 2  d  e gen=unnamed
Rorschach
  • 29,991
  • 5
  • 75
  • 122
  • In your answer, everything seems to be pushed to the left when you are fixing the number of columns. How would you push everything to the right if you want the NA values to be in X1? – alki Jul 23 '15 at 06:11
  • @Chani yes, that is a problem because the lists aren't named, so it is ambiguous which column they belong to when there are missing values. To always push right try `do.call(rbind, lapply(h, function(x) rev(\`length – Rorschach Jul 23 '15 at 06:14
  • I tried looking into rbindlist, as it is much faster on large lists. I'm trying `rbindlist(lapply(h, function(x) rev(length – alki Jul 23 '15 at 06:19
  • yea it should be fast. try `rbindlist(lapply(h, function(x) as.list(rev(\`length – Rorschach Jul 23 '15 at 06:29
  • Haha, now it completely reverses the order of the columns. It now becomes `location | locus | gene` instead of `gene | locus | location` while correctly pushing everything to the right – alki Jul 23 '15 at 06:33
  • @Chani ;p my bad, try this `do.call(rbind, lapply(h, function(x) if (any(is.na((res – Rorschach Jul 23 '15 at 06:39
  • I meant that your `rbindlist` way is reversed. Your `do.call` function was fine, but I just wanted to use a faster method – alki Jul 23 '15 at 06:43
  • Yeah I figured it out from here, thanks for the help – alki Jul 23 '15 at 06:49
1
# Data

l <- list(c("gene=dnaA","locus_tag=CD630_00010", "location=1..1320"),
character(0), c("locusc_tag=CD630_05950", "location=719777..720313"),
c("gene=dnrA","locus_tag=CD630_00010" ,"location=50..1320" ))

# Manipulation

n <- sapply(l, length)
seq.max <- seq_len(max(n))
df <-  t(sapply(l, "[", i = seq.max))
df <- t(apply(df,1,function(x){
  c(x[is.na(x)],x[!is.na(x)])}))
df <- df[rowSums(!is.na(df))>0, ]     
df[is.na(df)] <- "gen=unnamed"  

Output:

     [,1]          [,2]                     [,3]                     
[1,] "gene=dnaA"   "locus_tag=CD630_00010"  "location=1..1320"       
[2,] "gen=unnamed" "locusc_tag=CD630_05950" "location=719777..720313"
[3,] "gene=dnrA"   "locus_tag=CD630_00010"  "location=50..1320"      
mpalanco
  • 11,967
  • 2
  • 55
  • 64