3337

I'm using this code to get standard output from an external program:

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]

The communicate() method returns an array of bytes:

>>> command_stdout
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

However, I'd like to work with the output as a normal Python string. So that I could print it like this:

>>> print(command_stdout)
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

I thought that's what the binascii.b2a_qp() method is for, but when I tried it, I got the same byte array again:

>>> binascii.b2a_qp(command_stdout)
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

How do I convert the bytes value back to string? I mean, using the "batteries" instead of doing it manually. And I'd like it to be OK with Python 3.

Zoe stands with Ukraine
  • 25,310
  • 18
  • 114
  • 149
Tomas Sedovic
  • 39,061
  • 9
  • 37
  • 30
  • 128
    why doesn't `str(text_bytes)` work? This seems bizarre to me. – Charlie Parker Mar 14 '19 at 22:25
  • 49
    @CharlieParker Because `str(text_bytes)` can't specify the encoding. Depending on what's in text_bytes, `text_bytes.decode('cp1250`)` might result in a very different string to `text_bytes.decode('utf-8')`. – Craig Anderson Mar 31 '19 at 17:32
  • 13
    so `str` function does not convert to a real string anymore. One HAS to say an encoding explicitly for some reason I am to lazy to read through why. Just convert it to `utf-8` and see if ur code works. e.g. `var = var.decode('utf-8')` – Charlie Parker Apr 22 '19 at 23:32
  • 13
    @CraigAnderson: `unicode_text = str(bytestring, character_encoding)` works as expected on Python 3. Though `unicode_text = bytestring.decode(character_encoding)` is more preferable to avoid confusion with just `str(bytes_obj)` that produces a text representation for `bytes_obj` instead of decoding it to text: `str(b'\xb6', 'cp1252') == b'\xb6'.decode('cp1252') == '¶'` and `str(b'\xb6') == "b'\\xb6'" == repr(b'\xb6') != '¶'` – jfs Apr 12 '20 at 05:11

23 Answers23

5155

You need to decode the bytes object to produce a string:

>>> b"abcde"
b'abcde'

# utf-8 is used here because it is a very common encoding, but you
# need to use the encoding your data is actually in.
>>> b"abcde".decode("utf-8") 
'abcde'

See: https://docs.python.org/3/library/stdtypes.html#bytes.decode

Aaron Maenpaa
  • 113,855
  • 10
  • 93
  • 108
  • Yes, but given that this is the output from a windows command, shouldn't it instead be using ".decode('windows-1252')" ? – mcherm Jul 18 '11 at 19:48
  • 84
    Using `"windows-1252"` is not reliable either (e.g., for other language versions of Windows), wouldn't it be best to use `sys.stdout.encoding`? – nikow Jan 03 '12 at 15:20
  • 20
    Maybe this will help somebody further: Sometimes you use byte array for e.x. TCP communication. If you want to convert byte array to string cutting off trailing '\x00' characters the following answer is not enough. Use b'example\x00\x00'.decode('utf-8').strip('\x00') then. – Wookie88 Apr 16 '13 at 13:27
  • 3
    I've filled a bug about documenting it at http://bugs.python.org/issue17860 - feel free to propose a patch. If it is hard to contribute - comments how to improve that are welcome. – anatoly techtonik Apr 28 '13 at 14:40
  • 1
    what other decoding options does the binary object possess? – CMCDragonkai Apr 16 '14 at 02:59
  • 66
    In Python 2.7.6 doesn't handle `b"\x80\x02\x03".decode("utf-8")` -> `UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte`. – martineau May 18 '14 at 20:12
  • 19
    If the content is random binary values, the `utf-8` conversion is likely to fail. Instead see @techtonik answer (below) http://stackoverflow.com/a/27527728/198536 – wallyk May 27 '15 at 21:21
  • 1
    @AaronMaenpaa : This won’t work on an array like it worked in python2. – user2284570 Oct 20 '15 at 23:02
  • 1
    @Profpatsch: it's kinda hidden. See answer below for a reference to documentation. It's also in the bytes-docstring (`help(command_stdout)`). – serv-inc Nov 13 '15 at 10:25
  • 1
    @nikow: small update on using sys.stdout.encoding - this is allowed to be None which will cause encode() to fail. – Kevin Shea Oct 09 '17 at 12:03
  • 1
    I have some code for networking program. and its [def dataReceived(self, data): print(f"Received quote: {data}")] its printing out "received quote: b'\x00&C:\\Users\\.pycharm2016.3\\config\x00&C:\\users\\pycharm\\system\x00\x03--' how would i change my code to fix this. WHen i write print(f"receivedquote: {data}".decode('utf-8') that does not do the trick. – Jessica Warren Jan 01 '18 at 21:20
  • 3
    While this is generally the way to go, you need to be certain you've got the encoding right, or your code might end up vomiting all over itself. To make it worse, data from the outside world can contain unexpected encodings. The chardet library at https://pypi.org/project/chardet/ can help you with this, but again, always program defensively, sometimes even chardet can get it wrong, so wrap your junk with some appropriate Exception handling. – Shayne Jul 04 '18 at 17:39
  • UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 168: invalid start byte – Shihabudheen K M Jul 27 '18 at 06:46
  • 2
    why doesn't `str(text_bytes)` work? This seems bizarre to me. – Charlie Parker Mar 14 '19 at 22:25
  • is this expected? I get `AttributeError: 'str' object has no attribute 'decode'` but the string has a b at the beggining: `b'(Answer 1 Ack)\n'` hu?! – Charlie Parker Mar 14 '19 at 22:29
  • Had to use decode("Latin") for result of a PHP script. Had been getting charset-related errors otherwise when using print_r or var_dump. – Íhor Mé Jun 26 '20 at 22:32
  • 1
    **Official documentation for this:** for all `bytes` and `bytearray` operations (methods which can be called on these objects), see here: https://docs.python.org/3/library/stdtypes.html#bytes-methods. For `bytes.decode()` in particular, see here: https://docs.python.org/3/library/stdtypes.html#bytes.decode. – Gabriel Staples Mar 25 '21 at 04:12
  • what's it's vice versa? – Ali Nov 14 '21 at 06:08
  • This answer gives me: AttributeError: 'str' object has no attribute 'decode' in Python 3.9.13 – Seth May 27 '22 at 17:35
354

You need to decode the byte string and turn it in to a character (Unicode) string.

On Python 2

encoding = 'utf-8'
'hello'.decode(encoding)

or

unicode('hello', encoding)

On Python 3

encoding = 'utf-8'
b'hello'.decode(encoding)

or

str(b'hello', encoding)
Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
dF.
  • 71,061
  • 29
  • 127
  • 135
  • 5
    On Python 3, what if the string is in a variable? – Alaa M. Feb 27 '20 at 14:47
  • 2
    @AlaaM.: the same. If you have `variable = b'hello'`, then `unicode_text = variable.decode(character_encoding)` – jfs Apr 12 '20 at 05:03
  • 5
    for me, `variable = variable.decode()` automagically got it into a string format I wanted. – Alex Hall Jul 19 '20 at 03:41
  • 5
    @AlexHall> fwiw, you might be interested to know that automagic uses utf8, which is the default value for `encoding` arg if you do not supply it. See [`bytes.decode`](https://docs.python.org/3/library/stdtypes.html#bytes.decode) – spectras Apr 17 '21 at 11:12
  • Using any decoding gives me: AttributeError: 'str' object has no attribute 'decode' – Seth May 27 '22 at 17:43
247

I think this way is easy:

>>> bytes_data = [112, 52, 52]
>>> "".join(map(chr, bytes_data))
'p44'
Zohnannor
  • 5
  • 3
  • 5
Sisso
  • 2,839
  • 1
  • 14
  • 13
  • 6
    Thank you, your method worked for me when none other did. I had a non-encoded byte array that I needed turned into a string. Was trying to find a way to re-encode it so I could decode it into a string. This method works perfectly! – leetNightshade May 10 '14 at 00:28
  • 6
    @leetNightshade: yet it is terribly inefficient. If you have a byte array you only need to decode. – Martijn Pieters Sep 01 '14 at 16:25
  • 20
    @Martijn Pieters I just did a simple benchmark with these other answers, running multiple 10,000 runs http://stackoverflow.com/a/3646405/353094 And the above solution was actually much faster every single time. For 10,000 runs in Python 2.7.7 it takes 8ms, versus the others at 12ms and 18ms. Granted there could be some variation depending on input, Python version, etc. Doesn't seem too slow to me. – leetNightshade Sep 01 '14 at 17:06
  • @leetNightshade: yet the OP here is using Python 3. – Martijn Pieters Sep 01 '14 at 17:11
  • @Martijn Pieters Fair enough. In Python 3.4.1 x86 this method takes 17.01ms, the others 24.02ms, and 11.51ms for the bytearray to string cast. So it's not the fastest in that case. – leetNightshade Sep 01 '14 at 17:13
  • @leetNightshade: you also appear to be talking about integers and bytearrays, not a `bytes` value (as returned by `Popen.communicate()`). – Martijn Pieters Sep 01 '14 at 17:20
  • 5
    @Martijn Pieters Yes. So with that point, this isn't the best answer for the body of the question that was asked. And the title is misleading, isn't it? He/she wants to convert a byte string to a regular string, not a byte array to a string. This answer works okay for the title of the question that was asked. – leetNightshade Sep 01 '14 at 17:28
  • It can convert bytes read from a file with `"rb"` to string, and It's handy when you don't know the encoding – Sasszem Oct 01 '16 at 22:53
  • 9
    @Sasszem: this method is a perverted way to express: `a.decode('latin-1')` where `a = bytearray([112, 52, 52])` (["There Ain't No Such Thing as Plain Text"](http://www.joelonsoftware.com/articles/Unicode.html). If you've managed to convert bytes into a text string then you used some encoding—`latin-1` in this case) – jfs Nov 16 '16 at 03:16
  • 8
    For python 3 this should be equivalent to [`bytes([112, 52, 52])`](https://stackoverflow.com/a/35219695/281545) - btw bytes is a bad name for a local variable exactly because it's a p3 builtin – Mr_and_Mrs_D Oct 11 '17 at 15:14
  • 4
    @leetNightshade: For completeness sake: `bytes(list_of_integers).decode('ascii')` is about 1/3rd faster than `''.join(map(chr, list_of_integers))` on Python 3.6. – Martijn Pieters Jul 03 '18 at 12:01
123

If you don't know the encoding, then to read binary input into string in Python 3 and Python 2 compatible way, use the ancient MS-DOS CP437 encoding:

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('cp437'))

Because encoding is unknown, expect non-English symbols to translate to characters of cp437 (English characters are not translated, because they match in most single byte encodings and UTF-8).

Decoding arbitrary binary input to UTF-8 is unsafe, because you may get this:

>>> b'\x00\x01\xffsd'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid
start byte

The same applies to latin-1, which was popular (the default?) for Python 2. See the missing points in Codepage Layout - it is where Python chokes with infamous ordinal not in range.

UPDATE 20150604: There are rumors that Python 3 has the surrogateescape error strategy for encoding stuff into binary data without data loss and crashes, but it needs conversion tests, [binary] -> [str] -> [binary], to validate both performance and reliability.

UPDATE 20170116: Thanks to comment by Nearoo - there is also a possibility to slash escape all unknown bytes with backslashreplace error handler. That works only for Python 3, so even with this workaround you will still get inconsistent output from different Python versions:

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('utf-8', 'backslashreplace'))

See Python’s Unicode Support for details.

UPDATE 20170119: I decided to implement slash escaping decode that works for both Python 2 and Python 3. It should be slower than the cp437 solution, but it should produce identical results on every Python version.

# --- preparation

import codecs

def slashescape(err):
    """ codecs error handler. err is UnicodeDecode instance. return
    a tuple with a replacement for the unencodable part of the input
    and a position where encoding should continue"""
    #print err, dir(err), err.start, err.end, err.object[:err.start]
    thebyte = err.object[err.start:err.end]
    repl = u'\\x'+hex(ord(thebyte))[2:]
    return (repl, err.end)

codecs.register_error('slashescape', slashescape)

# --- processing

stream = [b'\x80abc']

lines = []
for line in stream:
    lines.append(line.decode('utf-8', 'slashescape'))
Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
anatoly techtonik
  • 18,796
  • 8
  • 119
  • 137
  • 6
    I really feel like Python should provide a mechanism to replace missing symbols and continue. – anatoly techtonik Feb 20 '15 at 09:04
  • @techtonik : This won’t work on an array like it worked in python2. – user2284570 Oct 20 '15 at 23:02
  • @user2284570 do you mean list? And why it should work on arrays? Especially arrays of floats.. – anatoly techtonik Oct 22 '15 at 07:25
  • 3
    You can also just ignore unicode errors with `b'\x00\x01\xffsd'.decode('utf-8', 'ignore')` in python 3. – Antonis Kalou Jul 06 '16 at 12:14
  • 3
    @anatolytechtonik There is the possibility to leave the escape sequence in the string and move on: `b'\x80abc'.decode("utf-8", "backslashreplace")` will result in `'\\x80abc'`. This information was taken from the [unicode documentation page](https://docs.python.org/3/howto/unicode.html#python-s-unicode-support) which seems to have been updated since the writing of this answer. – Nearoo Jan 16 '17 at 10:40
  • @Nearoo updated the answer. Unfortunately it doesn't work with Python 2 - see https://stackoverflow.com/questions/25442954/how-should-i-decode-bytes-using-ascii-without-losing-any-junk-bytes-if-xmlch – anatoly techtonik Jan 16 '17 at 14:53
  • "Decoding arbitrary binary input to UTF-8 is unsafe... The same applies to latin-1". Can you elaborate on this? `b'\x00\x01\xffsd'.decode("latin-1")` runs without crashing on my machine (tested in 2.7.11 and 3.7.3). Can you give an example of a bytes object that crashes with "ordinal not in range" when you try to latin1-decode it? – Kevin Jun 03 '19 at 13:58
  • "Decoding arbitrary binary input to UTF-8 is unsafe, because you may get this [error.]" Often, throwing an exception is considered *safer* than silently producing incorrect characters. It's considered safer to know that your data has been corrupted, than not to know. That's why Python 3 conversion from byte to string is designed the way it is. Your application may value resilience over correctness, but we can't assume that in general. – LarsH Nov 22 '19 at 02:32
  • Sorry I am not seeing \x80 off from the final output with print(line) b'\x80abc' . I have data like below not sure how can strip off first weird characters : bytearray(b'\x00\xfc\x01{"seq":4,"firstname":"Maria ","middlename":"Anne","lastname":"Jones","dob_year":2005,"dob_month":5,"gender":"F","salary":4000}') – pauldx Apr 22 '20 at 23:05
  • @Kevin can you try with the version of Python that was active on `Dec 17 '14 at 14:23` when I was writing this answer? That was some of the Python 2 versions with Windows, probably Vista. Most likely the error was fixed in Python 2.7 before freezing it – anatoly techtonik Jun 26 '20 at 11:51
  • Ok. Using 2.7.11 again, `b'\x00\x01\xffsd'.decode('utf-8')` crashes with `UnicodeDecodeError`, and `b'\x00\x01\xffsd'.decode('latin-1')` returns `u'\x00\x01\xffsd'`. Same as last time, I believe. – Kevin Jun 27 '20 at 15:41
  • As I understand it with the ISO 8859 series, the ISO definition only defines the printable characters, not the control codes, this is why you see gaps in the tables on wikipedia. However in practice codes 0-31 and 127-159 are mapped to the corresponding unicode control codes. So decoding arbitary bytes with ISO-8859-1 (aka latin1) is safe (this also applies to some but not all of the other ISO-8859 series encodings). – plugwash Oct 24 '20 at 04:40
116

In Python 3, the default encoding is "utf-8", so you can directly use:

b'hello'.decode()

which is equivalent to

b'hello'.decode(encoding="utf-8")

On the other hand, in Python 2, encoding defaults to the default string encoding. Thus, you should use:

b'hello'.decode(encoding)

where encoding is the encoding you want.

Note: support for keyword arguments was added in Python 2.7.

Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
lmiguelvargasf
  • 51,786
  • 40
  • 198
  • 203
49

I think you actually want this:

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>> command_text = command_stdout.decode(encoding='windows-1252')

Aaron's answer was correct, except that you need to know which encoding to use. And I believe that Windows uses 'windows-1252'. It will only matter if you have some unusual (non-ASCII) characters in your content, but then it will make a difference.

By the way, the fact that it does matter is the reason that Python moved to using two different types for binary and text data: it can't convert magically between them, because it doesn't know the encoding unless you tell it! The only way YOU would know is to read the Windows documentation (or read it here).

Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
mcherm
  • 22,452
  • 10
  • 43
  • 50
  • 4
    `open()` function for text streams or `Popen()` if you pass it `universal_newlines=True` do magically decide character encoding for you (`locale.getpreferredencoding(False)` in Python 3.3+). – jfs Feb 21 '14 at 17:00
  • 2
    `'latin-1'` is a verbatim encoding with all code points set, so you can use that to effectively read a byte string into whichever type of string your Python supports (so verbatim on Python 2, into Unicode for Python 3). – tripleee Feb 17 '17 at 07:32
  • @tripleee: `'latin-1'` is a good way to get mojibake. Also there are magical substitution on Windows: it is surprisingly hard to pipe data from one process to another unmodified e.g., [`dir`: `\xb6` -> `\x14` (the example at the end of my answer)](https://stackoverflow.com/a/40628661/4279) – jfs Apr 12 '20 at 05:00
40

Since this question is actually asking about subprocess output, you have more direct approaches available. The most modern would be using subprocess.check_output and passing text=True (Python 3.7+) to automatically decode stdout using the system default coding:

text = subprocess.check_output(["ls", "-l"], text=True)

For Python 3.6, Popen accepts an encoding keyword:

>>> from subprocess import Popen, PIPE
>>> text = Popen(['ls', '-l'], stdout=PIPE, encoding='utf-8').communicate()[0]
>>> type(text)
str
>>> print(text)
total 0
-rw-r--r-- 1 wim badger 0 May 31 12:45 some_file.txt

The general answer to the question in the title, if you're not dealing with subprocess output, is to decode bytes to text:

>>> b'abcde'.decode()
'abcde'

With no argument, sys.getdefaultencoding() will be used. If your data is not sys.getdefaultencoding(), then you must specify the encoding explicitly in the decode call:

>>> b'caf\xe9'.decode('cp1250')
'café'
wim
  • 302,178
  • 90
  • 548
  • 690
  • Decoding `ls` output using `utf-8` encoding may fail (see example in [my answer from 2016](https://stackoverflow.com/a/40628661/4279)). – jfs Nov 27 '19 at 17:18
  • 1
    @Boris: if `encoding` parameter is given, then the `text` parameter is ignored. – jfs Nov 27 '19 at 17:18
  • This is the proper answer for `subprocess`. Maybe still emphasize how `Popen` is almost always the wrong tool if you just want to wait for the subprocess and get its result; like the documentation says, use `subprocess.run` or one of the legacy functions `check_call` or `check_output`. – tripleee Dec 16 '21 at 07:24
38

Set universal_newlines to True, i.e.

command_stdout = Popen(['ls', '-l'], stdout=PIPE, universal_newlines=True).communicate()[0]
Borislav Sabev
  • 4,701
  • 1
  • 20
  • 30
ContextSwitch
  • 381
  • 3
  • 2
  • 6
    I've been using this method and it works. Although, it's just guessing at the encoding based on user preferences on your system, so it's not as robust as some other options. This is what it's doing, referencing docs.python.org/3.4/library/subprocess.html: "If universal_newlines is True, [stdin, stdout and stderr] will be opened as text streams in universal newlines mode using the encoding returned by locale.getpreferredencoding(False)." – twasbrillig Mar 01 '14 at 22:43
  • 3
    [On 3.7](https://docs.python.org/3/whatsnew/3.7.html#subprocess) you can (and should) do `text=True` instead of `universal_newlines=True`. – Boris Verkhovskiy Jan 13 '19 at 17:02
34

To interpret a byte sequence as a text, you have to know the corresponding character encoding:

unicode_text = bytestring.decode(character_encoding)

Example:

>>> b'\xc2\xb5'.decode('utf-8')
'µ'

ls command may produce output that can't be interpreted as text. File names on Unix may be any sequence of bytes except slash b'/' and zero b'\0':

>>> open(bytes(range(0x100)).translate(None, b'\0/'), 'w').close()

Trying to decode such byte soup using utf-8 encoding raises UnicodeDecodeError.

It can be worse. The decoding may fail silently and produce mojibake if you use a wrong incompatible encoding:

>>> '—'.encode('utf-8').decode('cp1252')
'—'

The data is corrupted but your program remains unaware that a failure has occurred.

In general, what character encoding to use is not embedded in the byte sequence itself. You have to communicate this info out-of-band. Some outcomes are more likely than others and therefore chardet module exists that can guess the character encoding. A single Python script may use multiple character encodings in different places.


ls output can be converted to a Python string using os.fsdecode() function that succeeds even for undecodable filenames (it uses sys.getfilesystemencoding() and surrogateescape error handler on Unix):

import os
import subprocess

output = os.fsdecode(subprocess.check_output('ls'))

To get the original bytes, you could use os.fsencode().

If you pass universal_newlines=True parameter then subprocess uses locale.getpreferredencoding(False) to decode bytes e.g., it can be cp1252 on Windows.

To decode the byte stream on-the-fly, io.TextIOWrapper() could be used: example.

Different commands may use different character encodings for their output e.g., dir internal command (cmd) may use cp437. To decode its output, you could pass the encoding explicitly (Python 3.6+):

output = subprocess.check_output('dir', shell=True, encoding='cp437')

The filenames may differ from os.listdir() (which uses Windows Unicode API) e.g., '\xb6' can be substituted with '\x14'—Python's cp437 codec maps b'\x14' to control character U+0014 instead of U+00B6 (¶). To support filenames with arbitrary Unicode characters, see Decode PowerShell output possibly containing non-ASCII Unicode characters into a Python string

jfs
  • 374,366
  • 172
  • 933
  • 1,594
28

While @Aaron Maenpaa's answer just works, a user recently asked:

Is there any more simply way? 'fhand.read().decode("ASCII")' [...] It's so long!

You can use:

command_stdout.decode()

decode() has a standard argument:

codecs.decode(obj, encoding='utf-8', errors='strict')

Felipe Augusto
  • 6,879
  • 9
  • 34
  • 69
serv-inc
  • 32,612
  • 9
  • 143
  • 165
  • `.decode()` that uses `'utf-8'` may fail (command's output may use a different character encoding or even return an undecodable byte sequence). Though if the input is ascii (a subset of utf-8) then `.decode()` works. – jfs Apr 12 '20 at 04:39
19

If you should get the following by trying decode():

AttributeError: 'str' object has no attribute 'decode'

You can also specify the encoding type straight in a cast:

>>> my_byte_str
b'Hello World'

>>> str(my_byte_str, 'utf-8')
'Hello World'
Felipe Augusto
  • 6,879
  • 9
  • 34
  • 69
Broper
  • 1,610
  • 1
  • 12
  • 15
18

If you have had this error:

utf-8 codec can't decode byte 0x8a,

then it is better to use the following code to convert bytes to a string:

bytes = b"abcdefg"
string = bytes.decode("utf-8", "ignore") 
LinFelix
  • 550
  • 1
  • 11
  • 21
Yasser M
  • 391
  • 4
  • 7
9

I made a function to clean a list

def cleanLists(self, lista):
    lista = [x.strip() for x in lista]
    lista = [x.replace('\n', '') for x in lista]
    lista = [x.replace('\b', '') for x in lista]
    lista = [x.encode('utf8') for x in lista]
    lista = [x.decode('utf8') for x in lista]

    return lista
Tshilidzi Mudau
  • 6,512
  • 6
  • 35
  • 45
eafloresf
  • 146
  • 1
  • 7
  • 6
    You can actually chain all of the `.strip`, `.replace`, `.encode`, etc calls in one list comprehension and only iterate over the list once instead of iterating over it five times. – Taylor D. Edmiston Jun 11 '17 at 19:04
  • 1
    @TaylorEdmiston Maybe it saves on allocation but the number of operations would remain the same. – JulienD Jul 28 '17 at 07:13
9

For Python 3, this is a much safer and Pythonic approach to convert from byte to string:

def byte_to_str(bytes_or_str):
    if isinstance(bytes_or_str, bytes): # Check if it's in bytes
        print(bytes_or_str.decode('utf-8'))
    else:
        print("Object not of byte type")

byte_to_str(b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n')

Output:

total 0
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2
Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
Taufiq Rahman
  • 5,436
  • 2
  • 35
  • 43
  • 6
    1) As @bodangly said, type checking is not pythonic at all. 2) The function you wrote is named "`byte_to_str`" which implies it will return a str, but it only prints the converted value, *and* it prints an error message if it fails (but doesn't raise an exception). This approach is also unpythonic and obfuscates the `bytes.decode` solution you provided. – cosmicFluke May 25 '18 at 19:51
9

When working with data from Windows systems (with \r\n line endings), my answer is

String = Bytes.decode("utf-8").replace("\r\n", "\n")

Why? Try this with a multiline Input.txt:

Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8")
open("Output.txt", "w").write(String)

All your line endings will be doubled (to \r\r\n), leading to extra empty lines. Python's text-read functions usually normalize line endings so that strings use only \n. If you receive binary data from a Windows system, Python does not have a chance to do that. Thus,

Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8").replace("\r\n", "\n")
open("Output.txt", "w").write(String)

will replicate your original file.

bers
  • 3,714
  • 1
  • 32
  • 44
  • I was looking for `.replace("\r\n", "\n")` addition so long. This is the answer if you want to render HTML properly. – mhlavacka Feb 20 '19 at 09:45
5

From sys — System-specific parameters and functions:

To write or read binary data from/to the standard streams, use the underlying binary buffer. For example, to write bytes to stdout, use sys.stdout.buffer.write(b'abc').

Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
Zhichang Yu
  • 343
  • 3
  • 7
  • 3
    The pipe to the subprocess is *already* a binary buffer. Your answer fails to address how to get a string value from the resulting `bytes` value. – Martijn Pieters Sep 01 '14 at 17:34
5

For your specific case of "run a shell command and get its output as text instead of bytes", on Python 3.7, you should use subprocess.run and pass in text=True (as well as capture_output=True to capture the output)

command_result = subprocess.run(["ls", "-l"], capture_output=True, text=True)
command_result.stdout  # is a `str` containing your program's stdout

text used to be called universal_newlines, and was changed (well, aliased) in Python 3.7. If you want to support Python versions before 3.7, pass in universal_newlines=True instead of text=True

Boris Verkhovskiy
  • 10,733
  • 7
  • 77
  • 79
4

Decode with .decode(). This will decode the string. Pass in 'utf-8') as the value in the inside.

Aarav Dave
  • 79
  • 1
  • 7
3
def toString(string):    
    try:
        return v.decode("utf-8")
    except ValueError:
        return string

b = b'97.080.500'
s = '97.080.500'
print(toString(b))
print(toString(s))
Leonardo Filipe
  • 1,139
  • 12
  • 8
  • 1
    While this code may answer the question, providing additional [context](https://meta.stackexchange.com/q/114762) regarding _how_ and/or _why_ it solves the problem would improve the answer's long-term value. Remember that you are answering the question for readers in the future, not just the person asking now! Please [edit] your answer to add an explanation, and give an indication of what limitations and assumptions apply. It also doesn't hurt to mention why this answer is more appropriate than others. – Dev-iL Jun 04 '18 at 05:37
3

If you want to convert any bytes, not just string converted to bytes:

with open("bytesfile", "rb") as infile:
    str = base64.b85encode(imageFile.read())

with open("bytesfile", "rb") as infile:
    str2 = json.dumps(list(infile.read()))

This is not very efficient, however. It will turn a 2 MB picture into 9 MB.

Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
HCLivess
  • 941
  • 1
  • 11
  • 20
3

try this

bytes.fromhex('c3a9').decode('utf-8') 
Victor Choy
  • 3,730
  • 23
  • 32
3

We can decode the bytes object to produce a string using bytes.decode(encoding='utf-8', errors='strict') For documentation. Click here

Python3 example:

byte_value = b"abcde"
print("Initial value = {}".format(byte_value))
print("Initial value type = {}".format(type(byte_value)))
string_value = byte_value.decode("utf-8")
# utf-8 is used here because it is a very common encoding, but you need to use the encoding your data is actually in.
print("------------")
print("Converted value = {}".format(string_value))
print("Converted value type = {}".format(type(string_value)))

Output:

Initial value = b'abcde'
Initial value type = <class 'bytes'>
------------
Converted value = abcde
Converted value type = <class 'str'>

NOTE: In Python3 by default encoding type is utf-8. So, <byte_string>.decode("utf-8") can be also written as <byte_string>.decode()

Shubhank Gupta
  • 390
  • 2
  • 7
  • 18
2

Try using this one; this function will ignore all the non character set (like utf-8) binaries and return a clean string. It is tested for python3.6 and above.

def bin2str(text, encoding = 'utf-8'):
    """Converts a binary to Unicode string by removing all non Unicode char
    text: binary string to work on
    encoding: output encoding *utf-8"""

    return text.decode(encoding, 'ignore')

Here, the function will take the binary and decode it (converts binary data to characters using python predefined character set and the ignore argument ignores all non-character set data from your binary and finally returns your desired string value.

If you are not sure about the encoding, use sys.getdefaultencoding() to get the default encoding of your device.

Ratul Hasan
  • 419
  • 5
  • 12