1

I have a sequence of raw unicode that was saved into a str variable:

s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"

I need to be able to get the byte literal of that unicode (for pickle.loads)

s_bytes: bytes = b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'.

Here the solution of using

s_new: bytes = bytes(s_str, encoding="raw_unicode_escape")

was posted, but it does not work for me. Instead of the desired s_bytes, I get

s_not_bytes = b'\\x00\\x01\\x00\\xc0\\x01\\x00\\x00\\x00\\x04'

that has two backslashes (actually representing only one) for each one that it should have.

Also here and here a similar solution is proposed, but it does not work for me either, I end up getting the double backslashes again. Does anyone have an idea of why this might be happening?

Thank you.

  • Other than the solution in the answers [this answer](https://stackoverflow.com/a/49990817/16153744) also works. – Tereso del Río Almajano Sep 07 '21 at 11:53
  • 1
    What "byte literal of that unicode" mean? Unicode has just code points, no byte representation (so abstract). It also defines few encodings, but so you should specify which encoding. Note: your initial string is already problematic: what do you mean with `\x`? Whant do you mean with `\xc0`? '\x' should not be used on unicode strings (but just on encoded strings or binary data). For unicode just use codepoints (\u and \U). I think your main problem is that you are mixing too many concepts (on a non recommended way), so it is easy to get it wrong. – Giacomo Catenazzi Sep 07 '21 at 12:27
  • It is not possible to get `s_not_bytes` (the result of `s_new`) from `s_str` as you have shown. `print(repr(s_str))` and post that. – Mark Tolonen Sep 07 '21 at 16:19

3 Answers3

1

You do not have byte escape codes as shown below (length 9) or you wouldn't get the s_not_bytes result:

s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"

You have literal escape codes (length 36), and note the r for raw string that prevents interpreting the escape codes as bytes:

s_str: str = r"\x00\x01\x00\xc0\x01\x00\x00\x00\x04"

Note the difference. \\ is an escape code indicating a literal, single backslash:

>>> '\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
'\x00\x01\x00À\x01\x00\x00\x00\x04'
>>> r'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
'\\x00\\x01\\x00\\xc0\\x01\\x00\\x00\\x00\\x04'
>>> len('\x00\x01\x00\xc0\x01\x00\x00\x00\x04')
9
>>> len(r'\x00\x01\x00\xc0\x01\x00\x00\x00\x04')
36

The following gets the desired byte string by converting each code point to a byte using the latin1 codec, which maps 1:1 between the first 256 code points (U+0000 to U+00FF) and the byte values 0x00 to 0xFF. Then it decodes the literal escape codes, resulting in a Unicode string again so once more encode using latin1 to convert 1:1 back to bytes:

s_bytes: bytes = s_str.encode('latin1').decode('unicode_escape').encode('latin1')
print(s_bytes)

Output:

b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'

If you did have s_str as posted, a simple .encode('latin1') would convert it:

>>> s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
>>> s_str.encode('latin1')
b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
Mark Tolonen
  • 148,243
  • 22
  • 160
  • 229
  • Thanks, this solves the issue. I was reading this from a file using ```open(file,'r')``` and I guess that creates a raw string. – Tereso del Río Almajano Sep 08 '21 at 09:44
  • And is there a way of reading from a file containing (raw text) either ```b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'```or ```\x00\x01\x00\xc0\x01\x00\x00\x00\x04``` so that it will be considered directly a string of bytes of length 9? – Tereso del Río Almajano Sep 08 '21 at 09:53
  • @TeresodelRíoAlmajano Reading a file doesn’t create a raw string. Raw strings are a way of creating string literals in code without interpreting escape codes. Your file had text with escape-code-like text. You can `open(file,encoding='unicode_escape')` if needed, but it would be better to post an actual sample of the file in case their is a better solution. – Mark Tolonen Sep 08 '21 at 11:45
0

I was about to post the question when I encounter a valid solution almost by chance. The combination that works for me is:

s_new: bytes = bytes(s_str.encode('utf-8').decode('unicode-escape'), encoding="oem")

As I said I have no idea why this works so feel free to explain it if you know why.

0

You might simply use .encode("utf-8") to get desired result i.e.:

s_1 = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
s_2 = s_1.encode("utf-8")
print(s_2)

output

b'\x00\x01\x00\xc3\x80\x01\x00\x00\x00\x04'
Daweo
  • 21,690
  • 3
  • 9
  • 19