1

I have a html text : If I'm reading lots of articles

I am trying to replace ' and other such special characters into unicode '. I did

rawtxt.encode('utf-8').encode('ascii','ignore') 

, but it fails

Error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2

kennytm
  • 491,404
  • 99
  • 1,053
  • 989
Harshit
  • 1,197
  • 19
  • 39
  • It looks like this is not really the code that produces the error because the error comes from trying to decode the string as ascii. Where does rawtxt come from? – Sarien May 16 '13 at 12:01
  • @Sarien: it is the code that produces the error. You can get a decode error in a call to `encode`. See: http://chat.stackoverflow.com/rooms/10/conversation/python2-decode-error-when-encoding – R. Martinho Fernandes May 16 '13 at 13:04

1 Answers1

3

You're having problems with HTML entities, not unicode or UTF-8. Try this:

import HTMLParser
h = HTMLParser.HTMLParser()
s = h.unescape('If I'm reading lots of articles')
print s

This prints If I'm reading lots of articles.

likeitlikeit
  • 5,455
  • 5
  • 38
  • 55