Applying a function to a large data set

Question

I have a large dataset I am reading in R I want to apply the Unique() function on it so I can work with it better, but when I try to do so, I get this prompted:

clients <- unique(clients)
Error: cannot allocate vector of size 27.9 Mb

So I am trying to apply this function part by part by doing this:

clientsmd<-data.frame()
n<-7316738  #Amount of observations in the dataset
t<-0
for(i in 1:200){
  clientsm<-clients[1+(t*round((n/200))):(t+1)*round((n/200)),]
  clientsm<-unique(clientsm)
  clientsmd<-rbind(clientsm)
  t<-(t+1) }

But I get this:

 Error in `[.default`(xj, i) : subscript too large for 32-bit R

I have been told that I could do this easier with packages such as "ff" or "bigmemory" (or any other) but I don't know how to use them for this purpose.

I'd thank any kind of orientation whether is to tell me why my code won't work or to say me how could I take advantage of this packages.

If `clients` is your whole dataframe, I suppose it has a column with a unique identifier. Say this column is called `id`. It might be worthwhile trying to see if `unique(clients$id)` or preferably `duplicated(clients$id)` works. This also enables you to subset `clients` to get all duplicates, which you can then check further including other columns. — coffeinjunky, Mar 14 '16 at 11:56
how much RAM do you have and what's the size of your `data.frame`? also it's important if you have 32 or 64 bit operating system. Your problem looks like simple memory issue, sometimes calling `gc()` function can help, or closing R and starting it again, you may try to free more RAM in your system by closing other running applications. And don't be scared to get familiar with `ff` and `ffbase` packages, you can convert you `data.frame` to `ffdf` like this `clients_ffdf — inscaven, Mar 15 '16 at 06:09

score 1 · Answer 1 · answered Mar 14 '16 at 12:01

Is clients a data.frame of data.table? data.table can handle quite large amounts of data compared to data.frame

library(data.table)

clients<-data.table(clients)

clientsUnique<-unique(clients)

or

duplicateIndex <-duplicated(clients)

will give rows that are duplicates.

score 1 · Answer 2 · edited Mar 14 '16 at 22:16

1

increase your memory limit like below and then try executing.

 memory.limit(4000)   ## windows specific command

edited Mar 14 '16 at 22:16

CAFEBABE

3,921
1
16
35

answered Mar 14 '16 at 12:13

Sowmya S. Manian

3,621
3
17
28

score 0 · Answer 3 · answered Mar 14 '16 at 11:54

0

You could use distinct function from dplyr package .

function - df %>% distinct(ID)

where ID is something unique in your dataframe .

answered Mar 14 '16 at 11:54

Pankaj Kaundal

970
3
11
25

What does it means `df %>%` ? – Gotey Mar 14 '16 at 12:11
Its an operator used in dplyr package . For more clarity open this link - http://stackoverflow.com/questions/24941080/meaning-of-symbol-in-r – Pankaj Kaundal Mar 14 '16 at 12:38

Applying a function to a large data set

3 Answers3