Split a string at uppercase letters, but only if a lowercase letter follows in Python

Question

I am using pdfminer.six in Python to extract long text data. Unfortunately, the Miner does not always work very well, especially with paragraphs and text wrapping. For example I got the following output:

"2018Annual ReportInvesting for Growth and Market LeadershipOur CEO will provide you with all further details below."

--> "2018 Annual Report Investing for Growth and Market Leadership Our CEO will provide you with all further details below."

Now I would like to insert a space whenever a lowercase letter is followed by a capital letter and then a smaller letter (and for numbers). So that in the end "2018Annual" becomes "2018 Annual" and "ReportInvesting" becomes "Report Investing", but "...CEO..." remains "...CEO...".

I only found solutions to Split a string at uppercase letters and https://stackoverflow.com/a/3216204/14635557 but could not rewrite it. Unfortunately I am totally new in the field of Python.

Even being new to Python coding, you should still try some coding and posting what you have tried before asking for solutions — Flavio Moraes, Nov 14 '20 at 14:55

score 5 · Accepted Answer · answered Nov 14 '20 at 14:08

We can try using re.sub here for a regex approach:

inp = "2018Annual ReportInvesting for Growth and Market LeadershipOur CEO will provide you with all further details below."
inp = re.sub(r'(?<![A-Z\W])(?=[A-Z])', ' ', inp)
print(inp)

This prints:

2018 Annual Report Investing for Growth and Market Leadership Our CEO will provide you with all further details below.

The regex used here says to insert a space at any point for which:

(?<![A-Z\W])  what precedes is a word character EXCEPT
              for capital letters
(?=[A-Z])     and what follows is a capital letter

score 0 · Answer 2 · answered Nov 14 '20 at 14:19

0

Try splitting with regex:

import re
temp = re.sub(r"([A-Z][a-z]+)", r"\1", string).split()

string = ' '.join(temp)

answered Nov 14 '20 at 14:19

conjectures

31
2

score 0 · Answer 3 · answered Nov 14 '20 at 14:40

I believe the code below gives the required result.

temp = re.sub(r"([a-z])([A-Z])", r"\1 \2", text)
temp = re.sub(r"(\d)([A-Za-z])", r"\1 \2", temp)

I still find complex regular expressions a bit challenging, hence the need to split the process into two expressions. Perhaps someone better at regular expressions can improve on this to show how it can be achieved in a more elegant way.

Split a string at uppercase letters, but only if a lowercase letter follows in Python

3 Answers3