Compress indices in R

Question

I have the data as follows:

    V1   V2
1 10001 1003
2 10002 1005
3 10002 1007
4 10003 1001
5 10003 1005
...

These are edge list data.

The index of V1 is really sparse, only a few of numbers in [1..10001] are occupied.

For example, it is something like max(V1) = 20000 but range(V1) = [10000, 20000].

I want to compress the index.

Here's what I've done:

sorted <- sort(data, index.return = T)

However for duplicated node index, different sorted index is returned. Also, I need the inverse index of the returned index (or, sorted$ix).

I'm new to R and how shall I do it?

The inverse of `sort` is `order`, but you would have the same problem with duplicates. Instead, you can convert the columns to factors and then use `as.numeric` to have smaller indices. — Vincent Zoonekynd, Jul 24 '13 at 08:01
Please show (a longer excerpt of) your input, the intended output and the output of `str(data)`. You should also read [this FAQ](http://stackoverflow.com/a/5963610/1412059). — Roland, Jul 24 '13 at 08:26
I aggree with Roland we need more info; For example one way to compress could be `rle()` but it depends on your needs... — digEmAll, Jul 24 '13 at 09:12

score 0 · Accepted Answer · edited Jul 24 '13 at 08:32

0

Maybe you could save some memory through casting the type of index into 'factor'.

For example:

> d <- data.frame(x = rep(c(1000, 2000), 10000), y=rep(c(100, 150), 10000)) 
> object.size(d)
320448 bytes
> d1 <- data.frame(x=as.factor(d$x), y=as.factor(d$y))
> object.size(d1)
160992 bytes

edited Jul 24 '13 at 08:32

Thomas

42,067
12
102
136

answered Jul 24 '13 at 08:30

Gao Hao

244
2
12

1

It seems like you need a compact index from your solution. Due to my small reputation, I add my comment here. Perhaps just need to append this line "levels(d1$x) – Gao Hao Jul 26 '13 at 02:17
Exactly. Thanks dude. – SolessChong Jul 26 '13 at 07:47

score 0 · Answer 2 · answered Jul 24 '13 at 09:31

I'm new to R and the code may be ugly. Please modify it if you find anything ugly.

The main idea is to perform unique and perform a look-up-table.

# index compression
V1_uniq = unique(data[,1])
V3_uniq = unique(data[,3])

user_n = length(V1_uniq)
ast_n = length(V3_uniq)

rst = sort(V1_uniq, index.return = T)
LUT1 = c(0)
for ( i in 1 : length(rst$x) )
    LUT1[V1_uniq[i]] = rst$ix[i]

usr_comp = LUT1[data[,1]]

rst = sort(V3_uniq, index.return = T)
LUT3 = c(0)
for ( i in 1 : length(rst$x) )
    LUT3[V3_uniq[i]] = rst$ix[i]

ast_comp = LUT3[data[,3]]

Compress indices in R

2 Answers2