Using ^ to match beginning of line in Python regex

Question

I'm trying to extract publication years ISI-style data from the Thomson-Reuters Web of Science. The line for "Publication Year" looks like this (at the very beginning of a line):

PY 2015

For the script I'm writing I have defined the following regex function:

import re
f = open('savedrecs.txt')
wosrecords = f.read()

def findyears():
    result = re.findall(r'PY (\d\d\d\d)', wosrecords)
    print result

findyears()

This, however, gives false positive results because the pattern may appear elsewhere in the data.

So, I want to only match the pattern at the beginning of a line. Normally I would use ^ for this purpose, but r'^PY (\d\d\d\d)' fails at matching my results. On the other hand, using \n seems to do what I want, but that might lead to further complications for me.

Use [`re.MULTILINE`](https://docs.python.org/2/library/re.html#re.MULTILINE) to change semantics of `^`: `re.findall(r'^PY (\d\d\d\d)', wosrecords, re.MULTILINE)` — Amadan, Jul 14 '15 at 07:33

score 39 · Accepted Answer · edited Mar 09 '21 at 09:40

39

re.findall(r'^PY (\d\d\d\d)', wosrecords, flags=re.MULTILINE)

should work

edited Mar 09 '21 at 09:40

Neuron - Freedom for Ukraine

4,303
4
29
51

answered Jul 14 '15 at 07:35

sinhayash

2,483
4
16
48

Wiktor Stribiżew · Answer 2 · 2015-07-14T07:43:09.420

10

Use re.search with re.M:

import re
p = re.compile(r'^PY\s+(\d{4})', re.M)
test_str = "PY123\nPY 2015\nPY 2017"
print(re.findall(p, test_str))

See IDEONE demo

EXPLANATION:

^ - Start of a line (due to re.M)
PY - Literal PY
\s+ - 1 or more whitespace
(\d{4}) - Capture group holding 4 digits

edited Jul 14 '15 at 07:43

answered Jul 14 '15 at 07:34

Wiktor Stribiżew

561,645
34
376
476

Yes, this should work too. What I had missed was the re.M or re.MULTILINE flag, which I didn't know affected the ^. – chrisk Jul 14 '15 at 07:56
Actually, that is the only function of `re.M`: to force `^` and `$` to match at the beginning and the end of line (before the `\n`) respectively. – Wiktor Stribiżew Jul 14 '15 at 07:57

score 2 · Answer 3 · answered Sep 02 '17 at 18:17

In this particular case there is no need to use regular expressions, because the searched string is always 'PY' and is expected to be at the beginning of the line, so one can use string.find for this job. The find function returns the position the substring is found in the given string or line, so if it is found at the start of the string the returned value is 0 (-1 if not found at all), ie.:

In [12]: 'PY 2015'.find('PY')
Out[12]: 0

In [13]: ' PY 2015'.find('PY')
Out[13]: 1

Perhaps it could be a good idea to strip the white spaces, ie.:

In [14]: '  PY 2015'.find('PY')
Out[14]: 2

In [15]: '  PY 2015'.strip().find('PY')
Out[15]: 0

And next if only the year is of interest it can be extracted with split, ie.:

In [16]: '  PY 2015'.strip().split()[1]
Out[16]: '2015'

Using ^ to match beginning of line in Python regex

3 Answers3

Linked

Related