Parsing malformed string in python

Question

Possible Duplicate:
Decode HTML entities in Python string?

I have a malformed string in Python:

Muhammad Ali&#39;s fight with Larry Holmes

where ' is a apostrophe.

Firstly what representation is this: '? Secondly, how can I parse the string in python so that it replaces ' with '

This looks like a HTML entity of a character with code 39 (which would make it easy to parse and reassemble using `chr()`. However there are is also a big number of symbolic HTML entities like `&` (`&`) which you'd probably want to also consider. — Kos, Nov 13 '11 at 20:17
@All: I did not know how to search for an answer because I did not know what to search. — Bruce, Nov 13 '11 at 20:20

score 5 · Accepted Answer · answered Nov 13 '11 at 20:20

The Python Standard Library's HTMLParser is able to decode HTML entities in strings.

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> s = h.unescape('&copy; 2010')
>>> s
u'\xa9 2010'
>>> print s
© 2010
>>> s = h.unescape('&#169; 2010')
>>> s
u'\xa9 2010'

A range of solutions are described here: http://fredericiana.com/2010/10/08/decoding-html-entities-to-text-in-python/

Adam Wagner · Answer 2 · 2011-11-13T20:24:31.647

1

The &#CHAR-CODE; is a sytax for special chars in html (maybe elsewhere, but I'm not sure). There may be a more complete way to do this, but you could replace it simply with:

mystring = "Muhammad Ali&#39;s fight with Larry Holmes"
print mystring.replace("&#39;", "'")

Yields:

Muhammad Ali's fight with Larry Holmes

edited Nov 13 '11 at 20:24

answered Nov 13 '11 at 20:17

Adam Wagner

14,498
7
52
65

Parsing malformed string in python

2 Answers2