130

Can you think of a nice way (maybe with itertools) to split an iterator into chunks of given size?

Therefore l=[1,2,3,4,5,6,7] with chunks(l,3) becomes an iterator [1,2,3], [4,5,6], [7]

I can think of a small program to do that but not a nice way with maybe itertools.

Gerenuk
  • 11,281
  • 17
  • 53
  • 87
  • 3
    @kindall: This is close, but not the same, due to the handling of the last chunk. – Sven Marnach Jan 24 '12 at 17:48
  • 5
    This is slightly different, as that question was about lists, and this one is more general, iterators. Although the answer appears to end up being the same. – recursive Jan 24 '12 at 17:48
  • @recursive: Yes, after reading the linked thread completely, I found that everything in my answer already appears somwhere in the other thread. – Sven Marnach Jan 24 '12 at 17:56
  • https://stackoverflow.com/a/312464/3798964 – johnson Oct 08 '20 at 09:52
  • VTR since [one of the linked questions](/q/434287) is about lists specifically, not iterables in general. – wjandrea Dec 17 '21 at 18:27
  • Does this answer your question? [Python generator that groups another iterable into groups of N](https://stackoverflow.com/questions/3992735/python-generator-that-groups-another-iterable-into-groups-of-n) – Tomerikoo Dec 20 '21 at 13:16

10 Answers10

163

The grouper() recipe from the itertools documentation's recipes comes close to what you want:

def grouper(n, iterable, fillvalue=None):
    "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

It will fill up the last chunk with a fill value, though.

A less general solution that only works on sequences but does handle the last chunk as desired is

[my_list[i:i + chunk_size] for i in range(0, len(my_list), chunk_size)]

Finally, a solution that works on general iterators and behaves as desired is

def grouper(n, iterable):
    it = iter(iterable)
    while True:
        chunk = tuple(itertools.islice(it, n))
        if not chunk:
            return
        yield chunk
zakdances
  • 19,631
  • 32
  • 97
  • 164
Sven Marnach
  • 530,615
  • 113
  • 910
  • 808
  • Thanks for this and all other ideas! Sorry that I missed the numerious threads already discussing this question. I had tried `islice` but somehow I missed that it indeed soaks up the iterator as desired. Now I'm thinking of defining a custom iterator class which provides all sorts of functionality :) – Gerenuk Jan 25 '12 at 09:52
  • Would `if chunk: yield chunk` be acceptable? it shaves a line off and is as semantic as a single `return`. – Capi Etheriel Oct 31 '14 at 16:52
  • 5
    @barraponto: No, it wouldn't be acceptable, since you would be left with an infinite loop. – Sven Marnach Oct 31 '14 at 17:57
  • 12
    I am surprised that this is such a highly-voted answer. The recipe works great for small `n`, but for large groups, is very inefficient. My n, e.g., is 200,000. Creating a temporary list of 200K items is...not ideal. – Jonathan Eunice Apr 24 '15 at 00:02
  • 5
    @JonathanEunice: In almost all cases, this is what people want (which is the reason why it is included in the Python documentation). Optimising for a particular special case is out of scope for this question, and even with the information you included in your comment, I can't tell what the best approach would be for you. If you want to chunk a list of numbers that fits into memory, you are probably best off using NumPy's `.resize()` message. If you want to chunk a general iterator, the second approach is already quite good -- it creates temporary tuples of size 200K, but that's not a big deal. – Sven Marnach Apr 26 '15 at 15:56
  • 4
    @SvenMarnach We'll have to disagree. I believe people want convenience, not gratuitous overhead. They get the overhead because the docs provide a needlessly bloated answer. With large data, temporary tuples/lists/etc. of 200K or 1M items make the program consume gigabytes of excess memory and take much longer to run. Why do that if you don't have to? At 200K, extra temp storage makes the overall program take 3.5x longer to run than with it removed. Just that one change. So it is a pretty big deal. NumPy won't work because the iterator is a database cursor, not a list of numbers. – Jonathan Eunice Apr 27 '15 at 02:24
  • @JonathanEunice: Sorry, when I said "the second approach" I actually meant the third one in my answer. There will only be a single 200K chunk at any given time, unless you store all of them (in which case you can't blame the code in this answer, but should blame your own code instead), and I can't see how this would use gigabytes of memory. That said, you are currently optimising along a very particular dimension, and all these optimisations have to be tailored to special cases. If you have a solution that you think is better for the general case, please enter an answer of your own. – Sven Marnach Apr 28 '15 at 09:19
  • @JonathanEunice: I think I still haven't understood your use case. – Sven Marnach Apr 28 '15 at 09:20
  • @JonathanEunice also, you are incorrect about the scale of the overhead. If you chunk a list using these methods, you are creating new objects for each chunk, the cost of that object already exists, so underneath the hood you just have to account for new points, so 200,000 * 8 * 1e-6 = 1.6 megabytes of overhead for a 200K size list. And about 5 times that for a million. – juanpa.arrivillaga Aug 16 '19 at 19:45
  • @juanpa.arrivillaga Note that I said "200K items." 200K items does of course consume ≫ 200K bytes, especially given Python not being particularly space-efficient. – Jonathan Eunice Aug 16 '19 at 21:08
  • @JonathanEunice yes that's what I accounted for and the memory overhead is about 1.6 megabytes, which is several orders of magnitude less than gigabytes of excess memory – juanpa.arrivillaga Aug 16 '19 at 21:23
  • 2
    @SvenMarnach I found out that my problem was due to the usage of `zip` in Python 2, which loads all data in memory, as opposed to `itertools.izip`. You can delete the previous comments and I will also delete this one. – nbro Oct 03 '19 at 00:53
  • 2
    izip_longest was renamed to zip_longest in Python 3 – hojin Oct 30 '19 at 12:47
81

Although OP asks function to return chunks as list or tuple, in case you need to return iterators, then Sven Marnach's solution can be modified:

def grouper_it(n, iterable):
    it = iter(iterable)
    while True:
        chunk_it = itertools.islice(it, n)
        try:
            first_el = next(chunk_it)
        except StopIteration:
            return
        yield itertools.chain((first_el,), chunk_it)

Some benchmarks: http://pastebin.com/YkKFvm8b

It will be slightly more efficient only if your function iterates through elements in every chunk.

Sid
  • 5,442
  • 2
  • 13
  • 18
reclosedev
  • 9,092
  • 32
  • 50
  • 20
    I arrived at almost exactly this design today, after finding the answer in the documentation (which is the accepted, most-highly-voted answer above) *massively* inefficient. When you're grouping hundreds of thousands or millions of objects at a time--which is when you need segmentation the most--it has to be pretty efficient. THIS is the right answer. – Jonathan Eunice Apr 24 '15 at 01:36
  • This is the best solution. – Lawrence Jan 31 '18 at 09:46
  • 4
    Won't this behave wrongly if the caller doesn't exhaust `chunk_it` (by breaking the inner loop early for example)? – Tavian Barnes Dec 18 '18 at 19:01
  • @TavianBarnes good point, if a first group is not exhausted, a second will start where the first left. But it may be considered as a feature if you want the both to be looped concurrently. Powerful but handle with care. – loutre Mar 01 '19 at 14:11
  • @TavianBarnes: This can be made to behave correctly in that case by making a cheap iterator consumer (fastest in CPython if you create it outside the loop is `consume = collections.deque(maxlen=0).extend`), then add `consume(chunk_it)` after the `yield` line; if the caller consumed the `yield`ed `chain`, it does nothing, if they didn't, it consumes it on their behalf as efficiently as possible. Put it in the `finally` of a `try` wrapping the `yield` if you need it to advance a caller provided iterator to the end of the chunk if the outer loop is broken early. – ShadowRanger Jan 15 '20 at 03:08
  • 1
    A little late to the party: this excellent answer could be shortened a bit by replacing the while loop with a for loop: `for x in it: yield chain((x,), islice(it, n))`, right? – Claas Feb 11 '22 at 16:58
  • @Claas that worked for me. at least so far. – kdubs May 24 '22 at 01:58
16

This will work on any iterable. It returns generator of generators (for full flexibility). I now realize that it's basically the same as @reclosedevs solution, but without the fluff. No need for try...except as the StopIteration propagates up, which is what we want.

The next(iterable) call is needed to raise the StopIteration when the iterable is empty, since islice will continue spawning empty generators forever if you let it.

It's better because it's only two lines long, yet easy to comprehend.

def grouper(iterable, n):
    while True:
        yield itertools.chain((next(iterable),), itertools.islice(iterable, n-1))

Note that next(iterable) is put into a tuple. Otherwise, if next(iterable) itself were iterable, then itertools.chain would flatten it out. Thanks to Jeremy Brown for pointing out this issue.

OrangeDog
  • 33,501
  • 12
  • 115
  • 195
Svein Lindal
  • 169
  • 1
  • 3
  • 3
    While that may answer the question including some part of explanation and description might help understand your approach and enlighten us as to why your answer stands out – deW1 Apr 08 '15 at 20:55
  • Don't just copy your answer to another question. If you need to do that, then it suggests that one is a duplicate of the other, which they are and I voted to close. – Artjom B. Apr 08 '15 at 21:12
  • It's a duplicate. Saw this thread after. Which turns out has a variation of my answer. – Svein Lindal Apr 09 '15 at 13:03
  • 2
    iterable.next() needs to be contained or yielded by an interator for the chain to work properly - eg. yield itertools.chain([iterable.next()], itertools.islice(iterable, n-1)) – Jeremy Brown Dec 16 '15 at 04:56
  • 3
    `next(iterable)`, not `iterable.next()`. – Antti Haapala -- Слава Україні Apr 28 '17 at 12:05
  • 4
    It might make sense to prefix the while loop with the line `iterable = iter(iterable)` to turn your *iterable* into an *iterator* first. [***Iterables* do not have a `__next__` method.**](https://stackoverflow.com/questions/9884132/what-exactly-are-iterator-iterable-and-iteration) – Mateen Ulhaq Nov 24 '18 at 04:53
  • 3
    Raising StopIteration in a generator function is deprecated since PEP479. So I prefer explicit return statement of@reclesedevs solution. – loutre Mar 01 '19 at 14:04
  • 2
    @loutre indeed in python 3.7 it raises an exception... – drevicko Aug 28 '19 at 09:11
  • Does not work in python 3.8. To fix put `iterable = iter(iterable)` at the beginning and a `try-except StopIteration: return` around the while loop. These modifications make the solution very similar (but not identical) to reclosedev's version. (It has roughly the same performance, but IMO is a bit cleaner.) – dlazesz Mar 13 '21 at 21:58
8

I was working on something today and came up with what I think is a simple solution. It is similar to jsbueno's answer, but I believe his would yield empty groups when the length of iterable is divisible by n. My answer does a simple check when the iterable is exhausted.

def chunk(iterable, chunk_size):
    """Generates lists of `chunk_size` elements from `iterable`.
    
    
    >>> list(chunk((2, 3, 5, 7), 3))
    [[2, 3, 5], [7]]
    >>> list(chunk((2, 3, 5, 7), 2))
    [[2, 3], [5, 7]]
    """
    iterable = iter(iterable)
    while True:
        chunk = []
        try:
            for _ in range(chunk_size):
                chunk.append(next(iterable))
            yield chunk
        except StopIteration:
            if chunk:
                yield chunk
            break
eidorb
  • 359
  • 1
  • 4
  • 8
3

Here's one that returns lazy chunks; use map(list, chunks(...)) if you want lists.

from itertools import islice, chain
from collections import deque

def chunks(items, n):
    items = iter(items)
    for first in items:
        chunk = chain((first,), islice(items, n-1))
        yield chunk
        deque(chunk, 0)

if __name__ == "__main__":
    for chunk in map(list, chunks(range(10), 3)):
        print chunk

    for i, chunk in enumerate(chunks(range(10), 3)):
        if i % 2 == 1:
            print "chunk #%d: %s" % (i, list(chunk))
        else:
            print "skipping #%d" % i
ekhumoro
  • 107,367
  • 18
  • 208
  • 308
  • Care to comment on how this works. – Marcin Jan 24 '12 at 19:44
  • 3
    A caveat: This generator yields iterables that remain valid only until the next iterable is requested. When using e.g. `list(chunks(range(10), 3))`, all iterables will already have been consumed. – Sven Marnach Jan 25 '12 at 14:19
3

A succinct implementation is:

chunker = lambda iterable, n: (ifilterfalse(lambda x: x == (), chunk) for chunk in (izip_longest(*[iter(iterable)]*n, fillvalue=())))

This works because [iter(iterable)]*n is a list containing the same iterator n times; zipping over that takes one item from each iterator in the list, which is the same iterator, with the result that each zip-element contains a group of n items.

izip_longest is needed to fully consume the underlying iterable, rather than iteration stopping when the first exhausted iterator is reached, which chops off any remainder from iterable. This results in the need to filter out the fill-value. A slightly more robust implementation would therefore be:

def chunker(iterable, n):
    class Filler(object): pass
    return (ifilterfalse(lambda x: x is Filler, chunk) for chunk in (izip_longest(*[iter(iterable)]*n, fillvalue=Filler)))

This guarantees that the fill value is never an item in the underlying iterable. Using the definition above:

iterable = range(1,11)

map(tuple,chunker(iterable, 3))
[(1, 2, 3), (4, 5, 6), (7, 8, 9), (10,)]

map(tuple,chunker(iterable, 2))
[(1, 2), (3, 4), (5, 6), (7, 8), (9, 10)]

map(tuple,chunker(iterable, 4))
[(1, 2, 3, 4), (5, 6, 7, 8), (9, 10)]

This implementation almost does what you want, but it has issues:

def chunks(it, step):
  start = 0
  while True:
    end = start+step
    yield islice(it, start, end)
    start = end

(The difference is that because islice does not raise StopIteration or anything else on calls that go beyond the end of it this will yield forever; there is also the slightly tricky issue that the islice results must be consumed before this generator is iterated).

To generate the moving window functionally:

izip(count(0, step), count(step, step))

So this becomes:

(it[start:end] for (start,end) in izip(count(0, step), count(step, step)))

But, that still creates an infinite iterator. So, you need takewhile (or perhaps something else might be better) to limit it:

chunk = lambda it, step: takewhile((lambda x: len(x) > 0), (it[start:end] for (start,end) in izip(count(0, step), count(step, step))))

g = chunk(range(1,11), 3)

tuple(g)
([1, 2, 3], [4, 5, 6], [7, 8, 9], [10])

Marcin
  • 46,667
  • 17
  • 117
  • 197
  • 1. The first code snippet contains the line `start = end`, which doesn't seem to be doing anything, since the next iteration of the loop will start with `start = 0`. Moreover, the loop is infinite -- it's `while True` without any `break`. 2. What is `len` in the second code snippet? 3. All other implementations only work for sequences, not for general iterators. 4. The check `x is ()` relies on an implementation detail of CPython. As an optimisation, the empty tuple is only created once and reused later. This is not guaranteed by the language specification though, so you should use `x == ()`. – Sven Marnach Jan 25 '12 at 14:11
  • 5. The combination of `count()` and `takewhile()` is much more easily implemented using `range()`. – Sven Marnach Jan 25 '12 at 14:11
  • @SvenMarnach: I've edited the code and text in response to some of your points. Much-needed proofing. – Marcin Jan 25 '12 at 14:20
  • 1
    That was fast. :) I still have an issue with the first code snippet: It only works if the yielded slices are consumed. If the user does not consume them immediately, strange things may happen. That's why Peter Otten used `deque(chunk, 0)` to consume them, but that solution has problems as well -- see my comment to his answer. – Sven Marnach Jan 25 '12 at 14:30
  • 1
    I like the last version of `chunker()`. As a side note, a nice way to create a unique sentinel is `sentinel = object()` -- it is guaranteed to be distinct from any other object. – Sven Marnach Jan 25 '12 at 14:33
  • I have reversed the order of my answers, so read @SvenMarnach's comments with care. – Marcin Jan 25 '12 at 14:35
  • @SvenMarnach: Nice tip on sentinels - that didn't occur to me. – Marcin Jan 25 '12 at 14:36
1

"Simpler is better than complex" - a straightforward generator a few lines long can do the job. Just place it in some utilities module or so:

def grouper (iterable, n):
    iterable = iter(iterable)
    count = 0
    group = []
    while True:
        try:
            group.append(next(iterable))
            count += 1
            if count % n == 0:
                yield group
                group = []
        except StopIteration:
            yield group
            break
jsbueno
  • 86,446
  • 9
  • 131
  • 182
1

I forget where I found the inspiration for this. I've modified it a little to work with MSI GUID's in the Windows Registry:

def nslice(s, n, truncate=False, reverse=False):
    """Splits s into n-sized chunks, optionally reversing the chunks."""
    assert n > 0
    while len(s) >= n:
        if reverse: yield s[:n][::-1]
        else: yield s[:n]
        s = s[n:]
    if len(s) and not truncate:
        yield s

reverse doesn't apply to your question, but it's something I use extensively with this function.

>>> [i for i in nslice([1,2,3,4,5,6,7], 3)]
[[1, 2, 3], [4, 5, 6], [7]]
>>> [i for i in nslice([1,2,3,4,5,6,7], 3, truncate=True)]
[[1, 2, 3], [4, 5, 6]]
>>> [i for i in nslice([1,2,3,4,5,6,7], 3, truncate=True, reverse=True)]
[[3, 2, 1], [6, 5, 4]]
Zach Young
  • 7,809
  • 4
  • 29
  • 48
  • This answer is close to the one I started with, but not quite: http://stackoverflow.com/a/434349/246801 – Zach Young Jan 24 '12 at 18:17
  • 1
    This only works for sequences, not for general iterables. – Sven Marnach Jan 25 '12 at 14:15
  • @SvenMarnach: Hi Sven, yes, thank you, you are absolutely correct. I saw the OP's example which used a list (sequence) and glossed over the wording of the question, assuming they meant sequence. Thanks for pointing that out, though. I didn't immediately understand the difference when I saw your comment, but have since looked it up. `:)` – Zach Young Jan 25 '12 at 16:02
1

Here you go.

def chunksiter(l, chunks):
    i,j,n = 0,0,0
    rl = []
    while n < len(l)/chunks:        
        rl.append(l[i:j+chunks])        
        i+=chunks
        j+=j+chunks        
        n+=1
    return iter(rl)


def chunksiter2(l, chunks):
    i,j,n = 0,0,0
    while n < len(l)/chunks:        
        yield l[i:j+chunks]
        i+=chunks
        j+=j+chunks        
        n+=1

Examples:

for l in chunksiter([1,2,3,4,5,6,7,8],3):
    print(l)

[1, 2, 3]
[4, 5, 6]
[7, 8]

for l in chunksiter2([1,2,3,4,5,6,7,8],3):
    print(l)

[1, 2, 3]
[4, 5, 6]
[7, 8]


for l in chunksiter2([1,2,3,4,5,6,7,8],5):
    print(l)

[1, 2, 3, 4, 5]
[6, 7, 8]
Carlos Quintanilla
  • 12,197
  • 3
  • 20
  • 25
0

Since python 3.8, there is a simpler solution using the := operator:

def grouper(it: Iterator, n: int) -> Iterator[list]:
    while chunck := list(itertools.islice(it, n)):
        yield chunck

usage:

>>> list(grouper(iter('ABCDEFG'), 3))
[['A', 'B', 'C'], ['D', 'E', 'F'], ['G']]