13

Biopython's .count() methods, like Python's str.count(), perform a non-overlapping count, how can I do an overlapping one?

For example, these code snippets return 2, but I want the answer 3:

>>> from Bio.Seq import Seq
>>> Seq('AAAA').count('AA')
2
>>> 'AAAA'.count('AA')
2
Chris_Rands
  • 3,948
  • 12
  • 31

3 Answers3

18

For Biopython 1.70, there is a new Seq.count_overlap() method, which includes optional start and end arguments:

>>> from Bio.Seq import Seq
>>> Seq('AAAA').count_overlap('AA')
3
>>> Seq('AAAA').count_overlap('AA', 1, 4)
2

This method is also implemented for the MutableSeq and UnknownSeq classes:

>>> from Bio.Seq import MutableSeq, UnknownSeq
>>> MutableSeq('AAAA').count_overlap('AA')
3
>>> UnknownSeq(4, character='A').count_overlap('AA')
3

Disclaimer: I co-contributed the .count_overlap() methods with Peter Cock, see 97709cc

Chris_Rands
  • 3,948
  • 12
  • 31
9

I've encountered this problem before, and used python re module to solve this problem.

import re
all = re.findall(r'(?=(AA))','AAAA')
counts = len(all)

You can get more details in this thread

l0o0
  • 325
  • 1
  • 8
  • 1
    This is a valid approach, and you could add start and end arguments easily enough, but it's not very memory efficient for large sequences. re.finditer would be better for that. Also, UnknownSeq type objects require a rather different approach to maximize memory efficiency – Chris_Rands Aug 10 '17 at 14:34
4

You can use finditer from python's re module. The advantage of this approach is it allows for getting the indices of those matches, which could be handy down the track.

>>> import re
>>> matches = re.finditer(r'(?=(AA))', 'AAAA')
>>> indices = [match.span(1) for match in matches]
>>> indices
[(0, 2), (1, 3), (2, 4)]
>>> num_matches = len(indices)
>>> num_matches
3
Michael Hall
  • 663
  • 4
  • 11