37

The goal is to convert a nested list which sometimes contain missing records into a data frame. An example of the structure when there are missing records is:

str(mylist)

List of 3
 $ :List of 7
  ..$ Hit    : chr "True"
  ..$ Project: chr "Blue"
  ..$ Year   : chr "2011"
  ..$ Rating : chr "4"
  ..$ Launch : chr "26 Jan 2012"
  ..$ ID     : chr "19"
  ..$ Dept   : chr "1, 2, 4"
 $ :List of 2
  ..$ Hit  : chr "False"
  ..$ Error: chr "Record not found"
 $ :List of 7
  ..$ Hit    : chr "True"
  ..$ Project: chr "Green"
  ..$ Year   : chr "2004"
  ..$ Rating : chr "8"
  ..$ Launch : chr "29 Feb 2004"
  ..$ ID     : chr "183"
  ..$ Dept   : chr "6, 8"

When there are no missing records the list can be converted into a data frame using data.frame(do.call(rbind.data.frame, mylist)). However, when records are missing this results in a column mismatch. I know there are functions to merge data frames of non-matching columns but I'm yet to find one that can be applied to lists. The ideal outcome would keep record 2 with NA for all variables. Hoping for some help.

Edit to add dput(mylist):

list(structure(list(Hit = "True", Project = "Blue", Year = "2011", 
Rating = "4", Launch = "26 Jan 2012", ID = "19", Dept = "1, 2, 4"), .Names = c("Hit", 
"Project", "Year", "Rating", "Launch", "ID", "Dept")), structure(list(
Hit = "False", Error = "Record not found"), .Names = c("Hit", 
"Error")), structure(list(Hit = "True", Project = "Green", Year = "2004", 
Rating = "8", Launch = "29 Feb 2004", ID = "183", Dept = "6, 8"), .Names = c("Hit", 
"Project", "Year", "Rating", "Launch", "ID", "Dept")))
Ritchie Sacramento
  • 22,522
  • 4
  • 39
  • 46

4 Answers4

40

You can also use (at least v1.9.3) of rbindlist in the data.table package:

library(data.table)

rbindlist(mylist, fill=TRUE)

##      Hit Project Year Rating      Launch  ID    Dept            Error
## 1:  True    Blue 2011      4 26 Jan 2012  19 1, 2, 4               NA
## 2: False      NA   NA     NA          NA  NA      NA Record not found
## 3:  True   Green 2004      8 29 Feb 2004 183    6, 8               NA
Arun
  • 113,200
  • 24
  • 277
  • 373
hrbrmstr
  • 74,560
  • 11
  • 127
  • 189
  • 1
    [1.9.4 is now available on CRAN](http://cran.r-project.org/web/packages/data.table/index.html) (although it may take a day more for remaining binaries to be available). – Arun Oct 03 '14 at 11:27
  • 5
    @hrbrmstr are you aware of a workaround that permits a non-uniform list structure? I'm running into `rbind/rbindlist doesn't recycle as it already expects each item to be a uniform list, data.frame or data.table`. – msoderstrom Jul 29 '18 at 11:05
  • 1
    I get this error: Error in data.table::rbindlist(mylist, fill = TRUE) : Column 3 of item 1 is length 2 inconsistent with column 5 which is length 3. Only length-1 columns are recycled. – PM0087 Nov 20 '20 at 20:00
  • How about nested lists? – PM0087 Feb 17 '21 at 17:19
21

You could create a list of data.frames:

dfs <- lapply(mylist, data.frame, stringsAsFactors = FALSE)

Then use one of these:

library(plyr)
rbind.fill(dfs)

or the faster

library(dplyr)
bind_rows(dfs) # in earlier versions: rbind_all(dfs)

In the case of dplyr::bind_rows, I am surprised that it chooses to use "" instead of NA for missing data. If you remove stringsAsFactors = FALSE, you will get NA but at the cost of a warning... So suppressWarnings(rbind_all(lapply(mylist, data.frame))) would be an ugly but fast solution.

Dima Lituiev
  • 11,752
  • 9
  • 35
  • 56
flodel
  • 85,263
  • 19
  • 176
  • 215
  • 16
    `rbind_all()` is deprecated. Please use `bind_rows()` instead. – psychonomics Jan 30 '17 at 16:25
  • What if in some of the rows, there is missing data for some of the columns? Just empty in the database (no NA or NULL) – PM0087 Nov 20 '20 at 19:57
  • I get this error: Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 0 – PM0087 Nov 20 '20 at 19:58
12

I just developed a solution for this question that is applicable here, so I'll provide it here as well:

tl <- function(e) { if (is.null(e)) return(NULL); ret <- typeof(e); if (ret == 'list' && !is.null(names(e))) ret <- list(type='namedlist') else ret <- list(type=ret,len=length(e)); ret; };
mkcsv <- function(v) paste0(collapse=',',v);
keyListToStr <- function(keyList) paste0(collapse='','/',sapply(keyList,function(key) if (is.null(key)) '*' else paste0(collapse=',',key)));

extractLevelColumns <- function(
    nodes, ## current level node selection
    ..., ## additional arguments to data.frame()
    keyList=list(), ## current key path under main list
    sep=NULL, ## optional string separator on which to join multi-element vectors; if NULL, will leave as separate columns
    mkname=function(keyList,maxLen) paste0(collapse='.',if (is.null(sep) && maxLen == 1L) keyList[-length(keyList)] else keyList) ## name builder from current keyList and character vector max length across node level; default to dot-separated keys, and remove last index component for scalars
) {
    cat(sprintf('extractLevelColumns(): %s\n',keyListToStr(keyList)));
    if (length(nodes) == 0L) return(list()); ## handle corner case of empty main list
    tlList <- lapply(nodes,tl);
    typeList <- do.call(c,lapply(tlList,`[[`,'type'));
    if (length(unique(typeList)) != 1L) stop(sprintf('error: inconsistent types (%s) at %s.',mkcsv(typeList),keyListToStr(keyList)));
    type <- typeList[1L];
    if (type == 'namedlist') { ## hash; recurse
        allKeys <- unique(do.call(c,lapply(nodes,names)));
        ret <- do.call(c,lapply(allKeys,function(key) extractLevelColumns(lapply(nodes,`[[`,key),...,keyList=c(keyList,key),sep=sep,mkname=mkname)));
    } else if (type == 'list') { ## array; recurse
        lenList <- do.call(c,lapply(tlList,`[[`,'len'));
        maxLen <- max(lenList,na.rm=T);
        allIndexes <- seq_len(maxLen);
        ret <- do.call(c,lapply(allIndexes,function(index) extractLevelColumns(lapply(nodes,function(node) if (length(node) < index) NULL else node[[index]]),...,keyList=c(keyList,index),sep=sep,mkname=mkname))); ## must be careful to translate out-of-bounds to NULL; happens automatically with string keys, but not with integer indexes
    } else if (type%in%c('raw','logical','integer','double','complex','character')) { ## atomic leaf node; build column
        lenList <- do.call(c,lapply(tlList,`[[`,'len'));
        maxLen <- max(lenList,na.rm=T);
        if (is.null(sep)) {
            ret <- lapply(seq_len(maxLen),function(i) setNames(data.frame(sapply(nodes,function(node) if (length(node) < i) NA else node[[i]]),...),mkname(c(keyList,i),maxLen)));
        } else {
            ## keep original type if maxLen is 1, IOW don't stringify
            ret <- list(setNames(data.frame(sapply(nodes,function(node) if (length(node) == 0L) NA else if (maxLen == 1L) node else paste(collapse=sep,node)),...),mkname(keyList,maxLen)));
        }; ## end if
    } else stop(sprintf('error: unsupported type %s at %s.',type,keyListToStr(keyList)));
    if (is.null(ret)) ret <- list(); ## handle corner case of exclusively empty sublists
    ret;
}; ## end extractLevelColumns()
## simple interface function
flattenList <- function(mainList,...) do.call(cbind,extractLevelColumns(mainList,...));

Execution:

## define data
mylist <- list(structure(list(Hit='True',Project='Blue',Year='2011',Rating='4',Launch='26 Jan 2012',ID='19',Dept='1, 2, 4'),.Names=c('Hit','Project','Year','Rating','Launch','ID','Dept')),structure(list(Hit='False',Error='Record not found'),.Names=c('Hit','Error')),structure(list(Hit='True',Project='Green',Year='2004',Rating='8',Launch='29 Feb 2004',ID='183',Dept='6, 8'),.Names=c('Hit','Project','Year','Rating','Launch','ID','Dept')));

## run it
df <- flattenList(mylist);
## extractLevelColumns():
## extractLevelColumns(): Hit
## extractLevelColumns(): Project
## extractLevelColumns(): Year
## extractLevelColumns(): Rating
## extractLevelColumns(): Launch
## extractLevelColumns(): ID
## extractLevelColumns(): Dept
## extractLevelColumns(): Error

df;
##     Hit Project Year Rating      Launch   ID    Dept            Error
## 1  True    Blue 2011      4 26 Jan 2012   19 1, 2, 4             <NA>
## 2 False    <NA> <NA>   <NA>        <NA> <NA>    <NA> Record not found
## 3  True   Green 2004      8 29 Feb 2004  183    6, 8             <NA>

My function is more powerful than data.table::rbindlist() as of 1.9.6, in that it can handle any number of nesting levels and different vector lengths across branches. In the linked question, my function correctly flattens the OP's list to a data.frame, but data.table::rbindlist() fails with "Error in rbindlist(jsonRList, fill = T) : Column 4 of item 16 is length 2, inconsistent with first column of that item which is length 1. rbind/rbindlist doesn't recycle as it already expects each item to be a uniform list, data.frame or data.table".

Community
  • 1
  • 1
bgoldst
  • 32,336
  • 5
  • 36
  • 61
  • 1
    Wow, finally I found a solution to flatten the type of list I'm facing. Thank you. – jcarlos Sep 07 '16 at 18:13
  • 2
    tried this on a complicated list and got: `Error in extractLevelColumns(lapply(nodes, function(node) if (length(node) < : error: inconsistent types () at /V1/2.` – dca Mar 28 '18 at 23:16
  • 1
    @GabrielFair (and @dca) If you post a link to your list (e.g. on GitHub) I might be able to debug and improve my code to handle your list, or at least improve the error message to make it more descriptive/clearer. – bgoldst Apr 16 '18 at 05:40
  • 1
    Thank you, sorry I should have been more clear. I'm getting the same error you are getting at the bottom of your post where you say you can't flatten OP's list. I'll create a new SO question, If I can't get this working on my own. Thanks again – Gabriel Fair Apr 16 '18 at 17:45
  • 1
    I also get an error: `Error in extractLevelColumns(lapply(nodes, `[[`, key), ..., keyList = c(keyList, : error: inconsistent types ` – JLC Feb 20 '19 at 20:23
  • Will you please write a whole package for flattening things? This is fantastic, thank you. – gladys_c_hugh Feb 10 '21 at 19:39
2

Here's a solution that converts any nested/uneven list to dataframe. rbindlist doesn't work for many cases, especially for list of lists. So I had to create something better than rbindlist.

rbindlist.v2 <- function(l)
{
   l <- l[lapply(l, class) == "list"]
   df <- foreach(element = l, .combine = bind_rows, .errorhandling = 'remove') %do%
         {df = unlist(element); df = as.data.frame(t(df)); rm(element); return(df)}
   rm(l)
   return(df)
}

For large lists you can expedite the process by replacing %do% to %dopar%. That was also something I needed for my case.

ishonest
  • 363
  • 2
  • 8