0

Possible Duplicate:
Decode HTML entities in Python string?

I have a malformed string in Python:

Muhammad Ali's fight with Larry Holmes

where ' is a apostrophe.

Firstly what representation is this: '? Secondly, how can I parse the string in python so that it replaces ' with '

Community
  • 1
  • 1
Bruce
  • 31,811
  • 70
  • 170
  • 259
  • 3
    This looks like a HTML entity of a character with code 39 (which would make it easy to parse and reassemble using `chr()`. However there are is also a big number of symbolic HTML entities like `&` (`&`) which you'd probably want to also consider. – Kos Nov 13 '11 at 20:17
  • @All: I did not know how to search for an answer because I did not know what to search. – Bruce Nov 13 '11 at 20:20

2 Answers2

5

The Python Standard Library's HTMLParser is able to decode HTML entities in strings.

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> s = h.unescape('© 2010')
>>> s
u'\xa9 2010'
>>> print s
© 2010
>>> s = h.unescape('© 2010')
>>> s
u'\xa9 2010'

A range of solutions are described here: http://fredericiana.com/2010/10/08/decoding-html-entities-to-text-in-python/

Acorn
  • 46,659
  • 24
  • 128
  • 169
1

The &#CHAR-CODE; is a sytax for special chars in html (maybe elsewhere, but I'm not sure). There may be a more complete way to do this, but you could replace it simply with:

mystring = "Muhammad Ali's fight with Larry Holmes"
print mystring.replace("'", "'")

Yields:

Muhammad Ali's fight with Larry Holmes

Adam Wagner
  • 14,498
  • 7
  • 52
  • 65