0

I am trying to take a users input in octal UTF-8 bytes and convert them to normal UTF-8 characters. The input is being taken from an entry field(field) in tkinter, this is how I am processing it:

input = ((self.field.get(1.0,END)).split('\n\')))
print (bytes(input[0], 'utf-8').decode('unicode_escape'))

for the example character \350\260\242 this prints "è ° ¢" when it should print 谢.

b'\350\260\252'.decode('utf-8')

returns the correct character but this is useless as I am trying to take a users input. Is there any way to take a user's input directly as bytes or is there a better way to do my decodings? any help is appreciated

User9123
  • 3
  • 1

1 Answers1

0

Yeah, unicode_escape is a bit weird in that in converts from a bytestring of escape sequences to a unicode string (which makes sense, since that's what it's for). You could use the "round-trip through latin-1 mojibake" trick:

>>> br'\350\260\252'.decode('unicode_escape')
'è°ª'
>>> _.encode('l1').decode('u8')
'谪'

(Which works because latin-1 is a 1-to-1 mapping of the first 256 code points.)

And there's also the undocumented codecs.escape_decode:

>>> codecs.escape_decode(br'\350\260\252')[0].decode()
'谪'

Naturally, both of these codecs are inherently tailored towards python syntax in particular, so you'll have to roll your own to just handle octal escapes.

Josh Lee
  • 161,055
  • 37
  • 262
  • 269