4

I've been given a stack of demographic data to look over by the small nonprofit I work for, though as the original electronic files have been misplaced by a predecessor I'm having to reverse-engineer it from a strangely-formatted hardcopy. The data is currently of the form;

  • 00001
  • Name
  • Address 1
  • Postcode
  • Etc

  • 00002

  • Name
  • Address 1
  • Postcode
  • Etc

with each individual a paragraph and each line a field.

While I've managed to scan and OCR the text into Notepad++ as well as strip out the printer's markings and other non-data, I'm struggling to get it transformed into excel (ie each para from the txt forms a new excel row, with lines within each para seperated horizontally into adjacent cells). There are enough entries that copy-pasting them across line by line is infeasible.

Could anyone suggest a better way of doing this?

  • I think such a question is only marginally on-topic here - it is straying a bit into the territory of "how to do X in Excel" (or some other software), but the issue of how to deal with poorly formatted data is one that crops up in statistical analysis. When you say "with each individual a paragraph and each line a field", what is splitting the lines? And is there any difference between how lines and paragraphs are separated? – Silverfish Aug 11 '16 at 12:19
  • 1
    Issues like these sometimes require creativity to resolve. The optimal solution can depend on how much hardcopy you have, how consistently it is formatted, how accurately OCR can perform, whether there are additional data, how redundant the information might be, and much more. As @Silverfish indicates, the Excel issues are not on topic here (and arguably Excel is one of the last tools you should consider using for this task), but there might be interest in helping out with the overall problem of converting your hardcopy reliably into electronic data--can you supply some specifics? – whuber Aug 11 '16 at 12:55

1 Answers1

0

Assuming you have a regular arrangement of rows, i.e. every set of four rows should be split across 4 columns, then put your data in one column. The following VBA code: Transpose every 5 or n rows from one column to multiple columns does the job, changing 5 to 4, or the number of lines required. Here is the input and output:

Reshaped data

Alternatively, you can use How to convert multiple rows and columns to columns and rows in Excel, which I did not try.