How can I do an overlapping sequence count in Biopython?

Question

Biopython's .count() methods, like Python's str.count(), perform a non-overlapping count, how can I do an overlapping one?

For example, these code snippets return 2, but I want the answer 3:

>>> from Bio.Seq import Seq
>>> Seq('AAAA').count('AA')
2
>>> 'AAAA'.count('AA')
2

score 18 · Accepted Answer · answered Jul 11 '17 at 13:10

For Biopython 1.70, there is a new Seq.count_overlap() method, which includes optional start and end arguments:

>>> from Bio.Seq import Seq
>>> Seq('AAAA').count_overlap('AA')
3
>>> Seq('AAAA').count_overlap('AA', 1, 4)
2

This method is also implemented for the MutableSeq and UnknownSeq classes:

>>> from Bio.Seq import MutableSeq, UnknownSeq
>>> MutableSeq('AAAA').count_overlap('AA')
3
>>> UnknownSeq(4, character='A').count_overlap('AA')
3

Disclaimer: I co-contributed the .count_overlap() methods with Peter Cock, see 97709cc

l0o0 · Answer 2 · 2017-07-12T05:17:53.587

9

I've encountered this problem before, and used python re module to solve this problem.

import re
all = re.findall(r'(?=(AA))','AAAA')
counts = len(all)

You can get more details in this thread

edited Jul 12 '17 at 05:17

answered Jul 12 '17 at 03:18

l0o0

325
1
8

1

This is a valid approach, and you could add start and end arguments easily enough, but it's not very memory efficient for large sequences. re.finditer would be better for that. Also, UnknownSeq type objects require a rather different approach to maximize memory efficiency – Chris_Rands Aug 10 '17 at 14:34

Michael Hall · Answer 3 · 2017-08-14T08:58:49.933

4

You can use finditer from python's re module. The advantage of this approach is it allows for getting the indices of those matches, which could be handy down the track.

>>> import re
>>> matches = re.finditer(r'(?=(AA))', 'AAAA')
>>> indices = [match.span(1) for match in matches]
>>> indices
[(0, 2), (1, 3), (2, 4)]
>>> num_matches = len(indices)
>>> num_matches
3

edited Aug 14 '17 at 08:58

answered Aug 14 '17 at 08:53

Michael Hall

663
4
11

How can I do an overlapping sequence count in Biopython?

3 Answers3