Convert unicode string into byte string

Question

I have a sequence of raw unicode that was saved into a str variable:

s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"

I need to be able to get the byte literal of that unicode (for pickle.loads)

s_bytes: bytes = b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'.

Here the solution of using

s_new: bytes = bytes(s_str, encoding="raw_unicode_escape")

was posted, but it does not work for me. Instead of the desired s_bytes, I get

s_not_bytes = b'\\x00\\x01\\x00\\xc0\\x01\\x00\\x00\\x00\\x04'

that has two backslashes (actually representing only one) for each one that it should have.

Also here and here a similar solution is proposed, but it does not work for me either, I end up getting the double backslashes again. Does anyone have an idea of why this might be happening?

Thank you.

Other than the solution in the answers [this answer](https://stackoverflow.com/a/49990817/16153744) also works. — Tereso del Río Almajano, Sep 07 '21 at 11:53
What "byte literal of that unicode" mean? Unicode has just code points, no byte representation (so abstract). It also defines few encodings, but so you should specify which encoding. Note: your initial string is already problematic: what do you mean with `\x`? Whant do you mean with `\xc0`? '\x' should not be used on unicode strings (but just on encoded strings or binary data). For unicode just use codepoints (\u and \U). I think your main problem is that you are mixing too many concepts (on a non recommended way), so it is easy to get it wrong. — Giacomo Catenazzi, Sep 07 '21 at 12:27
It is not possible to get `s_not_bytes` (the result of `s_new`) from `s_str` as you have shown. `print(repr(s_str))` and post that. — Mark Tolonen, Sep 07 '21 at 16:19

Mark Tolonen · Answer 1 · 2021-09-07T16:41:27.817

You do not have byte escape codes as shown below (length 9) or you wouldn't get the s_not_bytes result:

s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"

You have literal escape codes (length 36), and note the r for raw string that prevents interpreting the escape codes as bytes:

s_str: str = r"\x00\x01\x00\xc0\x01\x00\x00\x00\x04"

Note the difference. \\ is an escape code indicating a literal, single backslash:

>>> '\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
'\x00\x01\x00À\x01\x00\x00\x00\x04'
>>> r'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
'\\x00\\x01\\x00\\xc0\\x01\\x00\\x00\\x00\\x04'
>>> len('\x00\x01\x00\xc0\x01\x00\x00\x00\x04')
9
>>> len(r'\x00\x01\x00\xc0\x01\x00\x00\x00\x04')
36

The following gets the desired byte string by converting each code point to a byte using the latin1 codec, which maps 1:1 between the first 256 code points (U+0000 to U+00FF) and the byte values 0x00 to 0xFF. Then it decodes the literal escape codes, resulting in a Unicode string again so once more encode using latin1 to convert 1:1 back to bytes:

s_bytes: bytes = s_str.encode('latin1').decode('unicode_escape').encode('latin1')
print(s_bytes)

Output:

b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'

If you did have s_str as posted, a simple .encode('latin1') would convert it:

>>> s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
>>> s_str.encode('latin1')
b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'

Thanks, this solves the issue. I was reading this from a file using ```open(file,'r')``` and I guess that creates a raw string. — Tereso del Río Almajano, Sep 08 '21 at 09:44
And is there a way of reading from a file containing (raw text) either ```b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'```or ```\x00\x01\x00\xc0\x01\x00\x00\x00\x04``` so that it will be considered directly a string of bytes of length 9? — Tereso del Río Almajano, Sep 08 '21 at 09:53
@TeresodelRíoAlmajano Reading a file doesn’t create a raw string. Raw strings are a way of creating string literals in code without interpreting escape codes. Your file had text with escape-code-like text. You can `open(file,encoding='unicode_escape')` if needed, but it would be better to post an actual sample of the file in case their is a better solution. — Mark Tolonen, Sep 08 '21 at 11:45

score 0 · Answer 2 · answered Sep 07 '21 at 10:47

I was about to post the question when I encounter a valid solution almost by chance. The combination that works for me is:

s_new: bytes = bytes(s_str.encode('utf-8').decode('unicode-escape'), encoding="oem")

As I said I have no idea why this works so feel free to explain it if you know why.

score 0 · Answer 3 · answered Sep 07 '21 at 11:01

0

You might simply use .encode("utf-8") to get desired result i.e.:

s_1 = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
s_2 = s_1.encode("utf-8")
print(s_2)

output

b'\x00\x01\x00\xc3\x80\x01\x00\x00\x00\x04'

answered Sep 07 '21 at 11:01

Daweo

21,690
3
9
19

No, that does not solve the problem of the double backslash. At least for me. – Tereso del Río Almajano Sep 07 '21 at 11:45
@TeresodelRíoAlmajano what version of python are you using? – Daweo Sep 07 '21 at 11:47
I am using Python 3.9. But the other answer explains why I was not getting the desired result. – Tereso del Río Almajano Sep 08 '21 at 09:37

Convert unicode string into byte string

3 Answers3