I'm looking to parse a large number of lines of repetitive but unstructured data. This is a task that happens at least once every project, in my experience, so I'm looking for a tool to transform fairly standard text into structured data. Right now I just use a combination of regex find and replace and one-off python scripts.
Here's a clean example:
Distance: 25.903 miles*
Morgan Road Middle School Extension
HEPHZIBAH GA, 30815
Telephone: 706.504.4071
A unit of: Boys & Girls Clubs of Augusta
http://www.bgcaugusta.org
And here's a slightly messier example:
Maria Teresa’s Babies Early Enrichment Center/Daycare 825 23rd Street South Arlington, VA 22202 703-979-BABY (2229) 22. Maria Teresa Desaba, Owner/Director; Tony Saba, Org. Director. Website: www.mariateresasbabies.com Serving children 6 wks to 5yrs full-time. National Science Foundation Child Development Center 23. 4201 Wilson Blvd., Suite 180 22203 703-292-4794 Website: www.brighthorizons.com 112 children, ages 6 wks - 5 yrs. 7:00 a.m. – 6:00 p.m. Summer Camp for children 5 - 9 years.
These are only examples. The issue of parsing unstructured data that is nonetheless repetitive is something I come across fairly often, especially when receiving text or word documents in response to FOIA requests. Mostly I'm wondering if someone has written a tool or library that's good at converting these documents into structured data, or if I should be thinking about how to write something myself.