0

I have a large dataset I am reading in R I want to apply the Unique() function on it so I can work with it better, but when I try to do so, I get this prompted:

clients <- unique(clients)
Error: cannot allocate vector of size 27.9 Mb

So I am trying to apply this function part by part by doing this:

clientsmd<-data.frame()
n<-7316738  #Amount of observations in the dataset
t<-0
for(i in 1:200){
  clientsm<-clients[1+(t*round((n/200))):(t+1)*round((n/200)),]
  clientsm<-unique(clientsm)
  clientsmd<-rbind(clientsm)
  t<-(t+1) }

But I get this:

 Error in `[.default`(xj, i) : subscript too large for 32-bit R

I have been told that I could do this easier with packages such as "ff" or "bigmemory" (or any other) but I don't know how to use them for this purpose.

I'd thank any kind of orientation whether is to tell me why my code won't work or to say me how could I take advantage of this packages.

Gotey
  • 227
  • 3
  • 11
  • 31
  • If `clients` is your whole dataframe, I suppose it has a column with a unique identifier. Say this column is called `id`. It might be worthwhile trying to see if `unique(clients$id)` or preferably `duplicated(clients$id)` works. This also enables you to subset `clients` to get all duplicates, which you can then check further including other columns. – coffeinjunky Mar 14 '16 at 11:56
  • how much RAM do you have and what's the size of your `data.frame`? also it's important if you have 32 or 64 bit operating system. Your problem looks like simple memory issue, sometimes calling `gc()` function can help, or closing R and starting it again, you may try to free more RAM in your system by closing other running applications. And don't be scared to get familiar with `ff` and `ffbase` packages, you can convert you `data.frame` to `ffdf` like this `clients_ffdf – inscaven Mar 15 '16 at 06:09

3 Answers3

1

Is clients a data.frame of data.table? data.table can handle quite large amounts of data compared to data.frame

library(data.table)

clients<-data.table(clients)

clientsUnique<-unique(clients)

or

duplicateIndex <-duplicated(clients) 

will give rows that are duplicates.

iboboboru
  • 1,082
  • 2
  • 9
  • 20
1

increase your memory limit like below and then try executing.

 memory.limit(4000)   ## windows specific command
CAFEBABE
  • 3,921
  • 1
  • 16
  • 35
Sowmya S. Manian
  • 3,621
  • 3
  • 17
  • 28
0

You could use distinct function from dplyr package .

function - df %>% distinct(ID)

where ID is something unique in your dataframe .

Pankaj Kaundal
  • 970
  • 3
  • 11
  • 25