4

I want to reshape a 2d scipy.sparse.csr.csr_matrix(let us call it A) to a 2d numpy.ndarray (let us call this B).

A could be

>shape(A)
(90, 10)

then B should be

>shape(B)
(9,10)

where each 10 rows of A would be reshaped in a new new value, namely the maximum of this window and column. The column operator is not working on this unhashable type of a sparse matrix. How can I get this B by using matrix multiplications?

Saullo G. P. Castro
  • 53,388
  • 26
  • 170
  • 232
Milla Well
  • 3,113
  • 3
  • 32
  • 48
  • 4
    I don't think matrix multiplication is an option for the **maximum**, could be for the **sum**. Have you considered using CSC format instead, which does support column slicing? – Jaime Jan 23 '13 at 10:40
  • I can't think of a way of doing this fast on a sparse array. But even a `(9000, 1000)` array can be processed in about `100 ms` on my system doing `rows, cols = sparse_mat.shape` and then `np.max(sparse_mat.toarray().reshape(rows // 10, 10, cols), axis=1)`. – Jaime Jan 23 '13 at 18:50
  • thanks for comments, In my case it needs a few seconds, because the matrices are much larger. But I put some prior knowledge into, that you didn't have: the number of rows exceeds the number of cols significantly, thus I will loop over cols, reshape each and compute the maximum. This is not what I wanted, but it seems like there is nothing radically faster. – Milla Well Jan 24 '13 at 08:06
  • @MillaWell did you try the answer below? – Saullo G. P. Castro Oct 17 '14 at 12:52

1 Answers1

3

Using matrix multiplication you can do an efficient slicing creating a "slicer" matrix with ones at the right places. The sliced matrix will have the same type as the "slicer", so you can control in an efficient way your output type.

Below you will see some comparisons and the most efficient for you case is to ask for the .A matrix and slice it. It showed to be much faster than the .toarray() method. Using multiplication is the second fastest option when the "slicer" is created as a ndarray, multiplied by the csr matrix and slice the result .

OBS: using a coo sparse for matrix A resulted in a slightly slower timing, keeping the same proportions, and sol3 is not applicable, I realized later that in the multiplication it is converted to a csr automatically.

import scipy
import scipy.sparse.csr as csr
test = csr.csr_matrix([
[11,12,13,14,15,16,17,18,19],
[21,22,23,24,25,26,27,28,29],
[31,32,33,34,35,36,37,38,39],
[41,42,43,44,45,46,47,48,49],
[51,52,53,54,55,56,57,58,59],
[61,62,63,64,65,66,67,68,69],
[71,72,73,74,75,76,77,78,79],
[81,82,83,84,85,86,88,88,89],
[91,92,93,94,95,96,99,98,99]])

def sol1():
    B = test.A[2:5]

def sol2():
    slicer = scipy.array([[0,0,0,0,0,0,0,0,0],
                          [0,0,0,0,0,0,0,0,0],
                          [0,0,1,0,0,0,0,0,0],
                          [0,0,0,1,0,0,0,0,0],
                          [0,0,0,0,1,0,0,0,0]])
    B = (slicer*test)[2:]
    return B

def sol3():
    B = (test[2:5]).A
    return B

def sol4():
    slicer = csr.csr_matrix( ((1,1,1),((2,3,4),(2,3,4))), shape=(5,9) )
    B = ((slicer*test).A)[2:] # just changing when we do the slicing
    return B

def sol5():
    slicer = csr.csr_matrix( ((1,1,1),((2,3,4),(2,3,4))), shape=(5,9) )
    B = ((slicer*test)[2:]).A
    return B


timeit sol1()
#10000 loops, best of 3: 60.4 us per loop

timeit sol2()
#10000 loops, best of 3: 91.4 us per loop

timeit sol3()
#10000 loops, best of 3: 111 us per loop

timeit sol4()
#1000 loops, best of 3: 310 us per loop

timeit sol5()
#1000 loops, best of 3: 363 us per loop

EDIT: the answer has been updated replacing .toarray() by .A, giving much faster results and now the best solutions are placed on top

Saullo G. P. Castro
  • 53,388
  • 26
  • 170
  • 232
  • I cannot reproduce your timings. I don't know if your test-design gives you some overhead on function calls since you wrapped your experiments in functions, but using `%timeit test.A[2:5]` vs. `%timeit test.toarray()[2:5]` is in favor of toarray() (on my system 5 us vs 4 us). – Markus Jul 22 '19 at 10:49
  • @Markus it's been a while since I posted this answer. Most likely the with the new versions of Scipy and Numpy things have changed quite a bit – Saullo G. P. Castro Jul 22 '19 at 11:32