4

I'm trying to parse a string containing a name and a degree. I have a long list of these. Some contain no degrees, some contain one, and some contain multiple.

Example strings:

Sam da Man J.D.
Green Eggs Jr. Ed.M.
Argle Bargle Sr. MA
Cersei Lannister M.A. Ph.D. 

As far as I can tell, the degrees come in the following patterns:

x.x.
x.x.x.
x.x.xx.
x.xx.
xx.x.
x.xxx.
two caps (ex: 'MA')

How would I parse this?

I'm new to regex and breaking down this problem has proved very time-consuming. I've been using this post and tried split = re.split('\s+|([.])',s) and split = re.split('\s+|\.',s) but these still split on the first space.

I have thought, in response to the first comment, about the degree designations. I've been trying to make a regex that recognizes 'x.x' and then a wildcard afterwards because there are several patterns within the degrees which look like this: x.x(something): x.x. x.x.x. x.x.xx.

and then I'd have a few more to classify.

Alternatively, classifying the name might be easier?

Or even listing the degrees in a collection and searching for them?

{'M.A.T.','Ph.D.','MA','J.D.','Ed.M.', 'M.A.', 'M.B.A.', 'Ed.S.', 'M.Div.', 'M.Ed.", 'RN', 'B.S.Ed.'}
Community
  • 1
  • 1
goldisfine
  • 4,532
  • 10
  • 53
  • 80

2 Answers2

0

Try to change your "Jr.", "Sr.", ... replacing them with something like this: "Jr~", "Sr~", ... This is the the regular expression for doing that:

/ (Jr|Sr)\. / $1~ /g

(See here )

You obtain this string:

Sam da Man J.D.
Green Eggs Jr~ Ed.M.
Argle Bargle Sr~ MA
Cersei Lannister M.A. Ph.D. 

Now you can easily capture degrees with this regular expression:

/ (MA|RN|([A-Z][a-z]?[a-z]?\.)+) /g

(See here )

Yossarian
  • 4,996
  • 1
  • 36
  • 56
fazen
  • 61
  • 7
0

you can use this:

'[ ](MA|RN|([A-Z][a-z]?[a-z]?\.){2,3})'

it doesn't take any word with one dot

MIE
  • 444
  • 2
  • 9