I have two data-frames in R, each with a column for an address (character). These strings aren't an exact match in most cases but are fairly close, for example, "510 East Bonham St", "510 e bonham". I'd like to be able to merge these datasets on partially matched strings (or full matches, if any) for the two columns.
I used gsub and got the columns to a slightly better level of comparability, i.e, removing commas, periods, lowercasing etc.
I have tried %like%, which works fairly well but often picks up a seemingly trivial pattern in the first string and matches it to nearly every string in the column from the second dataset.
Additionally, both data-frames have columns for the city, county, and state of these observations and I used these columns in conditional statements to give me some additional robustness. For example, %like% might match a string to, say, 10 address observations from the second data-frame, then I would filter such that the city, county, state variables (cleaned) were equal (%in%). This gave me a really small number of matched observations which I know shouldn't be the case.
Tried fuzzyjoin as well with similar results and am unsure what I am doing wrong.
Is there any function or package that can help with this task, or perhaps another way to go about this?
Thanks!
Another post asked a similar question, however, the context was picking a notion of a better address column between two columns.