15

I have this text

'''Hi, Mr. Sam D. Richards lives here, 44 West 22nd Street, New York, NY 12345. Can you contact him now? If you need any help, call me on 12345678'''

. How the address part can be extracted from the above text using NLTK? I have tried Stanford NER Tagger, which gives me only New York as Location. How to solve this?

Akshat Zala
  • 668
  • 1
  • 7
  • 23
ngrj
  • 327
  • 1
  • 3
  • 12
  • 3
    Most people would give regular [expressions](https://docs.python.org/2/howto/regex.html) a try. Besides that, a short search on SO will give you plenty of [inspiration](http://stackoverflow.com/questions/14087116/extract-address-from-string). – patrick Jun 10 '16 at 21:22
  • Thanks ! That gave me something to start with. – ngrj Jun 13 '16 at 10:56
  • Accept the answer please – Alex Jun 26 '16 at 11:29
  • patrick, that one's in php – Rohmer Jun 13 '17 at 20:38
  • here's a pretty solid [python, nltk write up](https://medium.com/@acrosson/extracting-names-emails-and-phone-numbers-5d576354baa). i'll type it into an answer here with the summary after i implement it myself. – Rohmer Jun 13 '17 at 21:30

3 Answers3

14

Definitely regular expressions :)

Something like

import re

txt = ...
regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
address = re.findall(regexp, txt)

# address = ['44 West 22nd Street, New York, NY 12345']

Explanation:

[0-9]{1,3}: 1 to 3 digits, the address number

(space): a space between the number and the street name

.+: street name, any character for any number of occurrences

,: a comma and a space before the city

.+: city, any character for any number of occurrences

,: a comma and a space before the state

[A-Z]{2}: exactly 2 uppercase chars from A to Z

[0-9]{5}: 5 digits

re.findall(expr, string) will return an array with all the occurrences found.

Alex
  • 6,171
  • 6
  • 17
  • 35
6

Pyap works best not just for this particular example but also for other addresses contained in texts.

text = ...
addresses = pyap.parse(text, country='US')
Bhio
  • 61
  • 1
  • 3
3

Checkout libpostal, a library dedicated to address extraction

It cannot extract address from raw text but may help in related tasks

jujule
  • 10,656
  • 3
  • 40
  • 60
  • Libpostal is used for normalising strings that have already been identified as addresses, which is a completely different task. – Boris Jul 16 '20 at 10:05