Summarise huge data using ffdfdply()

Question

I have several very large datasets (.csv files, from 4 - 9 GB). I used the ff and ffbase packages to load them into R and calculate daily mean, sum and max of energy expenditure values. The script worked for 15 out of 19 files, but now suddenly it doesn't work any more. I would still consider myself as new to R and I was just learning about working with those huge files.

Here is the script (found here: aggregation using ffdfdply function in R):

library(tidyverse)
library(ff) # to work with files 2 - 10 GB
library(ffbase)
#creating file
tab.ff <- read.csv.ffdf(file = "file.csv")
#creates a ffdf object 
class(tab.ff)
str(tab.ff)
# split by date -> assuming that all data of 1 date can fit into RAM
splitby <- as.character(tab.ff$Date, by = 250000)
grp_qty <- ffdfdply(x=tab.ff[c("Date","ODBA.Sm","VeDBA.smoothed")], 
                    split=splitby, 
                    FUN = function(tab.ff){
                      ## This happens in RAM - containing **several** split elements so here we can use data.table which works fine for in RAM computing
                      require(data.table)
                      tab.ff <- as.data.table(tab.ff)
                      result <- tab.ff[, list(ODBA_sum = sum(ODBA.Sm, na.rm = TRUE), VeDBA_sum = sum(VeDBA.smoothed, na.rm = TRUE),
                                            ODBA_mean = mean(ODBA.Sm, na.rm = TRUE), VeDBA_mean = mean(VeDBA.smoothed, na.rm = TRUE),
                                            ODBA_max = max(ODBA.Sm, na.rm = TRUE), VeDBA_max = max(VeDBA.smoothed, na.rm = TRUE)), by = list(Date)]
                      as.data.frame(result)
                    })
dim(grp_qty)
grp_qty # look at it

# export as csv file
write.csv.ffdf(grp_qty, file = "file.csv")

So as I said it worked perfectly for 15 files, but with four it gives me the following error while using ffdfdply:

2021-11-02 17:53:05, calculating split sizes
Error in grouprunningcumsum(x = as.integer(splitgroups$tab), max = MAXSIZE) : 
  NAs in foreign function call (arg 3)
In addition: Warning message:
In grouprunningcumsum(x = as.integer(splitgroups$tab), max = MAXSIZE) :
  NAs introduced by coercion to integer range

I would really appreciate if someone has an idea how to fix that, or maybe another way how to aggregate/summarize the mean, sum and max by Date. Thanks in advance!

Okay, found a work-around using SQLite databases. – Justine Güldenpfennig Nov 05 '21 at 10:57 — Justine Güldenpfennig, Nov 05 '21 at 10:57

Summarise huge data using ffdfdply()

0 Answers0