0

I just need to know HOW to search for TWO strings in a line of my file.

Example: I need the line to include both "protein_coding" and "exon". Then if it does include them I will print certain columns of each line. I know how to print them but cannot figure out how to search for TWO strings using reg ex. Thank you in advance.

is this correct?: if re.match("protein_coding" & "exon" in line:

dahlia
  • 273
  • 1
  • 5
  • 17
  • 1
    Please see http://stackoverflow.com/questions/24656131/regex-for-existience-of-some-words-whose-order-doesnt-matter/24656216#24656216 – Unihedron Jul 25 '14 at 14:46
  • I am hopeful this question is already asked and has an answer... Duplicate questions are not sign ... – Aditya Jul 25 '14 at 14:48

3 Answers3

3

This regex would match the lines which has both "protein_coding" & "exon" strings.

^.*?\bprotein_coding\b.*?\bexon\b.*$

DEMO

>>> import re
>>> data = """protein_coding exon foo bar
... foo
... protein_coding
... """
>>> m = re.findall(r'^.*?\bprotein_coding\b.*?\bexon\b.*$', data, re.M)
>>> for i in m:
...     print i
... 
protein_coding exon foo bar
Avinash Raj
  • 166,785
  • 24
  • 204
  • 249
3

If the test strings do not require the use of a regular expression, recall that you can use Python's string functions and in as well:

>>> line='protein_coding other stuff exon more stuff'
>>> "protein_coding" in line and "exon" in line
True

Or if you want to test an arbitrary number of words, use all and a tuple of targets words to test:

>>> line='protein_coding other stuff exon more stuff'
>>> all(s in line for s in ("protein_coding", "exon", "words"))
False
>>> all(s in line for s in ("protein_coding", "exon", "stuff"))
True

And if the matches are something that require a regex and you want to limit to multiple unrelated regexes, use all and a comprehension to test:

>>> p1=re.compile(r'\b[a-z]+_coding\b')
>>> p2=re.compile(r'\bexon\b')
>>> li=[p.search(line) for p in [p1, p2]]
>>> li
[<_sre.SRE_Match object at 0x10856d988>, <_sre.SRE_Match object at 0x10856d9f0>]
>>> all(e for e in li)
True 
dawg
  • 90,796
  • 20
  • 120
  • 197
1

Using anchors and lookahead assertions:

>>> re.findall(r'(?m)^(?=.*protein_coding)(?=.*exon).+$', data)

The inline (?m) modifier enables multi-line mode. The use of lookahead here matches both substrings regardless of the order they are in.

Live Demo

hwnd
  • 67,942
  • 4
  • 86
  • 123