Partition array into N chunks with Numpy

Question

There is this How do you split a list into evenly sized chunks? for splitting an array into chunks. Is there anyway to do this more efficiently for giant arrays using Numpy?

sorry i'm still looking for an efficient answer ;). right now i'm thinking ctypes is the only efficient way. — Eiyrioü von Kauyf, Dec 30 '13 at 22:10
Define efficient. Give some sample data, your current method, how fast it is, and how fast you need it to be. — Prashant Kumar, Dec 31 '13 at 18:19
Are we supposed to interpret the input to this question as a [native Python array](https://docs.python.org/3/library/array.html), or a [numpy ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html)? The first sentence seems to imply the former. The second sentence implies it's asking for a comparison between the former and the latter. 2-dimensional only, presumably. And when we say "efficiently... for giant arrays" are we more concerned with scaleability for asymptotically large N, regardless if it's slower for small N? — smci, Nov 09 '20 at 23:24

score 134 · Accepted Answer · edited Sep 09 '16 at 11:26

134

Try numpy.array_split.

From the documentation:

>>> x = np.arange(8.0)
>>> np.array_split(x, 3)
    [array([ 0.,  1.,  2.]), array([ 3.,  4.,  5.]), array([ 6.,  7.])]

Identical to numpy.split, but won't raise an exception if the groups aren't equal length.

If number of chunks > len(array) you get blank arrays nested inside, to address that - if your split array is saved in a, then you can remove empty arrays by:

[x for x in a if x.size > 0]

Just save that back in a if you wish.

edited Sep 09 '16 at 11:26

Shubham Chaudhary

42,546
9
73
80

answered Jan 18 '13 at 19:56

Prashant Kumar

17,553
14
47
63

how can you remove the empty lists though? – Eiyrioü von Kauyf Jan 18 '13 at 20:42
Can you provide a small example? – Prashant Kumar Jan 28 '13 at 22:48
if # chunks > len(array) you get blank arrays nested inside. – Eiyrioü von Kauyf Jan 29 '13 at 07:50
I simply wouldn't use # chunks > len(array), but I have included a second step which should remove empty arrays. Let me know if this works. – Prashant Kumar Jan 29 '13 at 15:08
1

yes that was what I was using ... but anyway to do that w/ numpy? List comprehensions in python are slow. – Eiyrioü von Kauyf Jan 29 '13 at 18:45
Does `np.array_split` copy the input array? – Zach Apr 10 '17 at 19:25
If you are looking for a way to control the split by the size of the chunk you can use: `np.array_split(x, np.arange(chunk_size, len(x), chunk_size))`. – Eduardo Pignatelli Jan 25 '21 at 11:53
1

@EiyrioüvonKauyf, to do it with numpy, just limit the number of elements to the length of the array: `np.array_split(x, min(len(x), 3))` where 3 is the default number of groups you want. – David Kaftan May 08 '21 at 14:50
This helps. Thanks :) – Sreekiran A R Oct 16 '21 at 15:34

score 24 · Answer 2 · edited Jul 23 '18 at 11:01

Just some examples on usage of array_split, split, hsplit and vsplit:

n [9]: a = np.random.randint(0,10,[4,4])

In [10]: a
Out[10]: 
array([[2, 2, 7, 1],
       [5, 0, 3, 1],
       [2, 9, 8, 8],
       [5, 7, 7, 6]])

Some examples on using array_split:
If you give an array or list as second argument you basically give the indices (before) which to 'cut'

# split rows into 0|1 2|3
In [4]: np.array_split(a, [1,3])
Out[4]:                                                                                                                       
[array([[2, 2, 7, 1]]),                                                                                                       
 array([[5, 0, 3, 1],                                                                                                         
       [2, 9, 8, 8]]),                                                                                                        
 array([[5, 7, 7, 6]])]

# split columns into 0| 1 2 3
In [5]: np.array_split(a, [1], axis=1)                                                                                           
Out[5]:                                                                                                                       
[array([[2],                                                                                                                  
       [5],                                                                                                                   
       [2],                                                                                                                   
       [5]]),                                                                                                                 
 array([[2, 7, 1],                                                                                                            
       [0, 3, 1],
       [9, 8, 8],
       [7, 7, 6]])]

An integer as second arg. specifies the number of equal chunks:

In [6]: np.array_split(a, 2, axis=1)
Out[6]: 
[array([[2, 2],
       [5, 0],
       [2, 9],
       [5, 7]]),
 array([[7, 1],
       [3, 1],
       [8, 8],
       [7, 6]])]

split works the same but raises an exception if an equal split is not possible

In addition to array_split you can use shortcuts vsplit and hsplit.
vsplit and hsplit are pretty much self-explanatry:

In [11]: np.vsplit(a, 2)
Out[11]: 
[array([[2, 2, 7, 1],
       [5, 0, 3, 1]]),
 array([[2, 9, 8, 8],
       [5, 7, 7, 6]])]

In [12]: np.hsplit(a, 2)
Out[12]: 
[array([[2, 2],
       [5, 0],
       [2, 9],
       [5, 7]]),
 array([[7, 1],
       [3, 1],
       [8, 8],
       [7, 6]])]

my problem with this is that if chunks > len(array) then you get blank nested arrays ... how do you get rid of that? — Eiyrioü von Kauyf, Jan 29 '13 at 07:48
Good examples, thank you. In your `np.array_split(a, [1], axis=1)` example, do you know how to prevent the first array from having every single element nested? — timgeb, Jan 04 '16 at 08:11

score 10 · Answer 3 · answered Jan 18 '13 at 19:55

10

I believe that you're looking for numpy.split or possibly numpy.array_split if the number of sections doesn't need to divide the size of the array properly.

answered Jan 18 '13 at 19:55

mgilson

283,004
58
591
667

same question as I asked Prashant. How can you get rid of the empty numpy arrays? – Eiyrioü von Kauyf Jan 18 '13 at 20:44

score 9 · Answer 4 · answered Jan 18 '13 at 20:43

9

Not quite an answer, but a long comment with nice formatting of code to the other (correct) answers. If you try the following, you will see that what you are getting are views of the original array, not copies, and that was not the case for the accepted answer in the question you link. Be aware of the possible side effects!

>>> x = np.arange(9.0)
>>> a,b,c = np.split(x, 3)
>>> a
array([ 0.,  1.,  2.])
>>> a[1] = 8
>>> a
array([ 0.,  8.,  2.])
>>> x
array([ 0.,  8.,  2.,  3.,  4.,  5.,  6.,  7.,  8.])
>>> def chunks(l, n):
...     """ Yield successive n-sized chunks from l.
...     """
...     for i in xrange(0, len(l), n):
...         yield l[i:i+n]
... 
>>> l = range(9)
>>> a,b,c = chunks(l, 3)
>>> a
[0, 1, 2]
>>> a[1] = 8
>>> a
[0, 8, 2]
>>> l
[0, 1, 2, 3, 4, 5, 6, 7, 8]

answered Jan 18 '13 at 20:43

Jaime

62,681
17
117
153

+1) that's a good point to consider, you could extend your solution further to handle certain multidim. cases – tzelleke Jan 18 '13 at 20:55
yes at the moment I use that. I was wondering of a nicer way to do that using numpy. esp. with multi-dim :( – Eiyrioü von Kauyf Jan 29 '13 at 07:49
This is relevant for larger data. I am using `numpy.array_split` which appears to make copies of the data. Passing that to your multiprocessing pool will make yet another copy of the data... – Stefan Falk Jan 11 '18 at 18:19

score 0 · Answer 5 · answered May 13 '20 at 07:20

How about this? Here you split the array using the length you want to have.

a = np.random.randint(0,10,[4,4])

a
Out[27]: 
array([[1, 5, 8, 7],
       [3, 2, 4, 0],
       [7, 7, 6, 2],
       [7, 4, 3, 0]])

a[0:2,:]
Out[28]: 
array([[1, 5, 8, 7],
       [3, 2, 4, 0]])

a[2:4,:]
Out[29]: 
array([[7, 7, 6, 2],
       [7, 4, 3, 0]])

MSS · Answer 6 · 2021-07-04T06:22:45.323

This can be achieved using as_strided of numpy. I have put a spin to answer by assuming that if chunk size is not a factor of total number of rows, then rest of the rows in the last batch will be filled with zeros.

from numpy.lib.stride_tricks import as_strided
def batch_data(test, chunk_count):
  m,n = test.shape
  S = test.itemsize
  if not chunk_count:
    chunk_count = 1
  batch_size = m//chunk_count
# Batches which can be covered fully
  test_batches = as_strided(test, shape=(chunk_count, batch_size, n), strides=(batch_size*n*S,n*S,S)).copy()
  covered = chunk_count*batch_size
  if covered < m:
    rest = test[covered:,:]
    rm, rn = rest.shape
    mismatch = batch_size - rm
    last_batch = np.vstack((rest,np.zeros((mismatch,rn)))).reshape(1,-1,n)
    return np.vstack((test_batches,last_batch))
  return test_batches

This is based on my answer https://stackoverflow.com/a/68238815/5462372.

Partition array into N chunks with Numpy

6 Answers6

Linked