3

I've used optical character recognition (OCR) on a historic directory, and am trying to clean up the text with Microsoft Word. Specifically, I need some help writing a Regular Expression to combine two lines together. For example something that is

John Smith, 87 Bank

Bldg

should actually be

John Smith, 87 Bank Bldg

I've tried several approaches, but haven't been successful at all. Can anyone help me with this?

Kyle.
  • 1,483
  • 1
  • 16
  • 32
user10322
  • 39
  • 1
  • Do you need to join texts that are separated by new lines? – untitledprogrammer Jun 24 '15 at 20:41
  • Yes but only in certain cases. Some lines are perfectly ok – user10322 Jun 24 '15 at 20:46
  • Sounds good. My solution should work then. Could you please show us a glimpse of your data? – untitledprogrammer Jun 24 '15 at 20:48
  • Sure, here you go: – user10322 Jun 24 '15 at 20:58
  • Employers' Liability Assurance Corporation (Limited) of London, England, Lawfford & McK3m, General Agents, 19 and 21 Chamber of Commerce Fidelity and Casualty Company of New York , BLrekhead & Son, Agents-,. 306 Water
    Fidelity and Casualty Oo of New York ,. Robt Schaefer,, res mngr, 22 s Holliday Frankfort Marine Accident and Pl ate-Glass Insurance Company of Frankfort-on-the-Main, Germany, Spear & Burbank, General Agents, 10-12 s Holliday . General Accident Assurance Corporation of Scotland S03 Merchants* Nat’l Bank Bldg GREAT EASTERN CASUALTY and Indemnity Company of NewYork,
    – user10322 Jun 24 '15 at 20:59
  • If you want it in another format let me know – user10322 Jun 24 '15 at 21:00
  • This won't help much cause the formatting goes away when you comment. Attach screen shots of the data in hand and how you want it to be processed. – untitledprogrammer Jun 24 '15 at 21:02
  • http://imgur.com/ztYCaDU,BoZYHo6#0 The first pic is what it looks like now, the second is what I'd like it to look like – user10322 Jun 24 '15 at 21:11
  • Is what programming language you want are trying to perform your combine transformation? – eliasah Sep 23 '15 at 16:00

3 Answers3

1

I have a solution that might not be very standardized but will suffice your need. Copy all of your data into any advanced text editor like, Notepad++ or Sublime Text. Next, Use CTRL+H to toggle the find and replace feature. Find : '\n' and Replace with ''.

0

Complete code in python:

import re
contents = open('path/to/file', 'r').read()
matches = re.findall(r'^(.*?)\n(.*?)$', contents)
joined = ['\t'.join([i[0] i[1]]) for i in matches]
print '\n'.join(joined)

I tested this out on rubular, and it works on your example. It should generalize to others as well. To walk you through the regular expression, ^ and $ match the beginning and end of lines, respectively, while (.*?) uses non-greedy wildcard matching to capture, first, everything up to the newline character, then everything up to the next end of line.

Kyle.
  • 1,483
  • 1
  • 16
  • 32
-1

simple... use advanced Text Editor like... Sublime text ,Notepad ,Vim or Atom..

paste all Bunch of data on any Editior and just remove '\n and replaced it by "" ..

use FIND AND REPLACE function

Notepad ,sublime text,vim and atom

use Ctrl H

find box type "\n" and in replace box ""

done :) ;)

tnx

  • This answer is the same as the previous answer. It also doesn't do a very good job of explaining the solution, uses poor formatting, and lots of abbreviations, emoticons, and text message style vernacular. If you have content not contained in the previous answer, then please edit your answer with a proper explanation and proper copy editing. If not, then consider deleting this answer. – AN6U5 Jul 25 '15 at 20:30