0

I have the data as follows:

    V1   V2
1 10001 1003
2 10002 1005
3 10002 1007
4 10003 1001
5 10003 1005
...

These are edge list data.

The index of V1 is really sparse, only a few of numbers in [1..10001] are occupied.

For example, it is something like max(V1) = 20000 but range(V1) = [10000, 20000].

I want to compress the index.

Here's what I've done:

sorted <- sort(data, index.return = T)

However for duplicated node index, different sorted index is returned. Also, I need the inverse index of the returned index (or, sorted$ix).

I'm new to R and how shall I do it?

SolessChong
  • 2,997
  • 7
  • 38
  • 63

2 Answers2

0

Maybe you could save some memory through casting the type of index into 'factor'.

For example:

> d <- data.frame(x = rep(c(1000, 2000), 10000), y=rep(c(100, 150), 10000)) 
> object.size(d)
320448 bytes
> d1 <- data.frame(x=as.factor(d$x), y=as.factor(d$y))
> object.size(d1)
160992 bytes
Thomas
  • 42,067
  • 12
  • 102
  • 136
Gao Hao
  • 244
  • 2
  • 12
0

I'm new to R and the code may be ugly. Please modify it if you find anything ugly.

The main idea is to perform unique and perform a look-up-table.

# index compression
V1_uniq = unique(data[,1])
V3_uniq = unique(data[,3])

user_n = length(V1_uniq)
ast_n = length(V3_uniq)

rst = sort(V1_uniq, index.return = T)
LUT1 = c(0)
for ( i in 1 : length(rst$x) )
    LUT1[V1_uniq[i]] = rst$ix[i]

usr_comp = LUT1[data[,1]]

rst = sort(V3_uniq, index.return = T)
LUT3 = c(0)
for ( i in 1 : length(rst$x) )
    LUT3[V3_uniq[i]] = rst$ix[i]

ast_comp = LUT3[data[,3]]
SolessChong
  • 2,997
  • 7
  • 38
  • 63