-2

I have a string without space.

ATG AGC TAA CTC AGG TGA TGG GGA ATG CCC CGC TAA

I need to extract string between ATG and ending with either TAG|TGA|TAA (should not include the end) . How do I extract from the string to get

ATGAGC and ATGCCCCGCTAA using regular expressions.

what I have tried

pattern = re.compile(r'(?=(ATG(?:...)*?)(?=TAG|TGA|TAA))')

it does not work as expected.

Soviut
  • 83,904
  • 44
  • 175
  • 239
Krishna Kalyan
  • 1,636
  • 1
  • 18
  • 42
  • 1
    What result do you get? – Soviut Sep 18 '16 at 04:45
  • 4
    I feel like this same questions gets asked around this time every year.... For instance [here](http://stackoverflow.com/q/18731894/) and [here](http://stackoverflow.com/q/16260794/) and [here](http://stackoverflow.com/q/19761908/) and [here](http://stackoverflow.com/q/31757876) and.... – Dan Sep 18 '16 at 04:46
  • 2
    Why `ATGCCCCGCTAA` is contain `TAA`? And `ATGAGC` doesn't? – Mazdak Sep 18 '16 at 04:47
  • 1
    @Dan - Haha. Perhaps you could link to the search page instead – OneCricketeer Sep 18 '16 at 04:52
  • @cricket_007 probably all I need to do is link to a search for the tags [tag:regex] AND [tag:python] AND [tag:bioinformatics] and every question would be some variation of regex patterns for genome patterns. I'd need to exclude the [tag:pandas] tag, though. – Dan Sep 18 '16 at 04:54

3 Answers3

1

Use following regex:

In [14]: regex = re.compile(r'(ATG.*?)(?:TAG|TGA|TAA)')

In [15]: regex.findall(s)
Out[15]: ['ATGAGC', 'ATGGGGAATGCCCCGC']

Note that these matches are not contain the trailing.

Mazdak
  • 100,514
  • 17
  • 155
  • 179
0
import re

pattern = re.compile(r'(ATG[A-Z]+)(?:TAG|TGA|TAA)')
results = pattern.search('ATGCCCCGCTAA')

print results.groups(0)

Results in

('ATGCCCCGC',)
Soviut
  • 83,904
  • 44
  • 175
  • 239
0

This works, given that endings are not included:

>>> re.findall(r'(ATG(?:...)*?)(?:TAG|TGA|TAA)', seq)
['ATGAGC', 'ATGCCCCGC']

?: means that patterns will not be captured in the result.

...: specifies exactly three characters. Alternative are .{3} or the more restrictive case of [ACTG]{3}

*?: implies minimum match. Without this, longest match will be obtained.

coder.in.me
  • 972
  • 8
  • 17