Regular Expressions in Word

Question

I've used optical character recognition (OCR) on a historic directory, and am trying to clean up the text with Microsoft Word. Specifically, I need some help writing a Regular Expression to combine two lines together. For example something that is

John Smith, 87 Bank

Bldg

should actually be

John Smith, 87 Bank Bldg

I've tried several approaches, but haven't been successful at all. Can anyone help me with this?

Sounds good. My solution should work then. Could you please show us a glimpse of your data? — untitledprogrammer, Jun 24 '15 at 20:48
Employers' Liability Assurance Corporation (Limited) of London, England, Lawfford & McK3m, General Agents, 19 and 21 Chamber of Commerce Fidelity and Casualty Company of New York , BLrekhead & Son, Agents-,. 306 Water
Fidelity and Casualty Oo of New York ,. Robt Schaefer,, res mngr, 22 s Holliday Frankfort Marine Accident and Pl ate-Glass Insurance Company of Frankfort-on-the-Main, Germany, Spear & Burbank, General Agents, 10-12 s Holliday . General Accident Assurance Corporation of Scotland S03 Merchants* Nat’l Bank Bldg GREAT EASTERN CASUALTY and Indemnity Company of NewYork, — user10322, Jun 24 '15 at 20:59
This won't help much cause the formatting goes away when you comment. Attach screen shots of the data in hand and how you want it to be processed. — untitledprogrammer, Jun 24 '15 at 21:02
http://imgur.com/ztYCaDU,BoZYHo6#0 The first pic is what it looks like now, the second is what I'd like it to look like — user10322, Jun 24 '15 at 21:11
Is what programming language you want are trying to perform your combine transformation? — eliasah, Sep 23 '15 at 16:00

score 1 · Answer 1 · answered Jun 24 '15 at 20:26

I have a solution that might not be very standardized but will suffice your need. Copy all of your data into any advanced text editor like, Notepad++ or Sublime Text. Next, Use CTRL+H to toggle the find and replace feature. Find : '\n' and Replace with ''.

score 0 · Answer 2 · answered Oct 20 '15 at 23:27

Complete code in python:

import re
contents = open('path/to/file', 'r').read()
matches = re.findall(r'^(.*?)\n(.*?)$', contents)
joined = ['\t'.join([i[0] i[1]]) for i in matches]
print '\n'.join(joined)

I tested this out on rubular, and it works on your example. It should generalize to others as well. To walk you through the regular expression, ^ and $ match the beginning and end of lines, respectively, while (.*?) uses non-greedy wildcard matching to capture, first, everything up to the newline character, then everything up to the next end of line.

score -1 · Answer 3 · answered Jul 25 '15 at 13:27

-1

simple... use advanced Text Editor like... Sublime text ,Notepad ,Vim or Atom..

paste all Bunch of data on any Editior and just remove '\n and replaced it by "" ..

use FIND AND REPLACE function

Notepad ,sublime text,vim and atom

use Ctrl H

find box type "\n" and in replace box ""

done :) ;)

tnx

answered Jul 25 '15 at 13:27

Rajesh Satvara

101
1

This answer is the same as the previous answer. It also doesn't do a very good job of explaining the solution, uses poor formatting, and lots of abbreviations, emoticons, and text message style vernacular. If you have content not contained in the previous answer, then please edit your answer with a proper explanation and proper copy editing. If not, then consider deleting this answer. – AN6U5 Jul 25 '15 at 20:30

Regular Expressions in Word

3 Answers3