I have nine spreadsheets containing information about a total of 33,401 unique events in the greater Chicago area. I have been asked to geocode all of these, if possible, and I am certainly no stranger to geocoding. However, the location information for each of them is the worst I have ever seen, written in a single field, with no particular convention.
I have no ZIP codes or city names, but I do have county names in almost every case. When street names are included, they are often missing their suffix ("Ave", "St", "Rd"). State and US highways are both frequently coded as the indiscriminate "Rte/Rt". The majority of locations have been written as intersections (often with extra irrelevant information), such as:
SB Pulaski & 162nd St.
I-55 @ Rt.30
Devon and Cicero (Il 50) NW corner TS
NB Rt.41 @ Half Day Rd. Exit Ramp.
In the case of Interstates, these "intersections" often do not actually exist -- it's just referencing a street the Interstate passes over. A fair number have (relatively) proper addresses:
1800 s Wolf rd. south of Oakton, north of Touhy.
1010 S. Rt. 14 - in front of Thunderbird Country C
Grayslake Maintenance Yard, 217 N. Baron, Grayslak
Some are more vague, but still generally well-specified:
South bound Busse rd south of Oakton and Higgins
EB Elgin-O'Hare W of Rohlwing Rd
NB IL-59, 1 mile north of IL-132
And some are almost certainly impossible to locate without additional context:
EB Elgin O'Hare expresway
Prairie View Rest Area
Comm Center/Stevenson Yard
My question is, given the wide range of formats and anything-goes approach for specifying addresses in these datasets, are there any suggested methods for parsing at least some of this stuff into a reasonably clean set of geocodable addresses? I've so far been stumped and have been going through the painful process of making sense of individual records in Google Maps. I want to trim off as much manual work as possible, as I'd prefer not to spend the next three years working on this.