396

I'm trying to scrape a website, but it gives me an error.

I'm using the following code:

import urllib.request
from bs4 import BeautifulSoup

get = urllib.request.urlopen("https://www.website.com/")
html = get.read()

soup = BeautifulSoup(html)

print(soup)

And I'm getting the following error:

File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 70924-70950: character maps to <undefined>

What can I do to fix this?

SstrykerR
  • 6,526
  • 3
  • 11
  • 10

8 Answers8

569

I was getting the same UnicodeEncodeError when saving scraped web content to a file. To fix it I replaced this code:

with open(fname, "w") as f:
    f.write(html)

with this:

with open(fname, "w", encoding="utf-8") as f:
    f.write(html)

If you need to support Python 2, then use this:

import io
with io.open(fname, "w", encoding="utf-8") as f:
    f.write(html)

If your file is encoded in something other than UTF-8, specify whatever your actual encoding is for encoding.

twasbrillig
  • 14,704
  • 9
  • 39
  • 61
  • 16
    In mac(python 3) works perfectly with just open without encoding, but in windows(w10, python3) is not an option. Just works in that way, with encoding="utf-8" param. – xtornasol512 Apr 30 '17 at 01:14
  • 3
    Thank you. It worked for me, i was working with xml files and writing the result of xml.toprettyxml() in a new file – Luis Cabrera Benito Jan 16 '18 at 17:06
  • 2
    This should be the accepted answer because it will eventually write a string to the output, and not a string representation of bytes. – Shirkan Feb 14 '19 at 13:55
  • OP requested to read the file however, not write the file. The issue seems to be console-related. – NaturalBornCamper Sep 20 '19 at 14:01
  • 1
    This works. But you didn't have to use io,all you had to do is to include `encoding="utf-8"` in the open function – Ecks Dee Jul 05 '20 at 15:35
  • 1
    when I save a file with Russian characters, it prints out some gibberish. – Petr L. Aug 15 '21 at 19:31
239

I fixed it by adding .encode("utf-8") to soup.

That means that print(soup) becomes print(soup.encode("utf-8")).

twasbrillig
  • 14,704
  • 9
  • 39
  • 61
SstrykerR
  • 6,526
  • 3
  • 11
  • 10
  • 6
    don't hardcode the character encoding of your environment (e.g., console) inside your script, [print Unicode directly instead](http://stackoverflow.com/a/32176732/4279) – jfs Sep 07 '15 at 04:09
  • 3
    This is just printing the repr of a `bytes` object, which will print as a mess of `\x` sequences if there's a lot of UTF-8 encoded text. I recommend using `win_unicode_console`, as @J.F.Sebastian suggests. – Eryk Sun May 23 '16 at 20:48
  • 3
    I used the above solution but sill getting issues: class MyStreamListener(tweepy.StreamListener): def on_status(self, status): print(str(status.encode("utf-8"))) UnicodeEncodeError: 'charmap' codec can't encode character '\u2019' in position 87: character maps to – Vivek Sep 26 '16 at 23:49
  • 5
    This makes it print out `b'\x02x\xc2\xa9'` (a bytes object) instead – MilkyWay90 Jun 16 '19 at 16:20
  • 3
    `print(soup.encode("utf-8"))` worked for me, but before that I had to also add `with open("f_name", encoding="utf-8") as f: soup = BeautifulSoup(f, "html.parser")` – TheWalkingData Nov 13 '19 at 22:57
72

In Python 3.7, and running Windows 10 this worked (I am not sure whether it will work on other platforms and/or other versions of Python)

Replacing this line:

with open('filename', 'w') as f:

With this:

with open('filename', 'w', encoding='utf-8') as f:

The reason why it is working is because the encoding is changed to UTF-8 when using the file, so characters in UTF-8 are able to be converted to text, instead of returning an error when it encounters a UTF-8 character that is not suppord by the current encoding.

MilkyWay90
  • 1,773
  • 1
  • 8
  • 20
Sabbir Ahmed
  • 885
  • 6
  • 7
34
set PYTHONIOENCODING=utf-8
set PYTHONLEGACYWINDOWSSTDIO=utf-8

You may or may not need to set that second environment variable PYTHONLEGACYWINDOWSSTDIO.

Alternatively, this can be done in code (although it seems that doing it through env vars is recommended):

sys.stdin.reconfigure(encoding='utf-8')
sys.stdout.reconfigure(encoding='utf-8')

Additionally: Reproducing this error was a bit of a pain, so leaving this here too in case you need to reproduce it on your machine:

set PYTHONIOENCODING=windows-1252
set PYTHONLEGACYWINDOWSSTDIO=windows-1252
Voy
  • 4,056
  • 1
  • 40
  • 50
  • 1
    This is perfect; I was getting this error while using the Python Debugger (pdb) on a Windows system looking at source code that used utf-8 and had lots of emoji in it. Every time I did a "list" command to see where I was, the "charmap" error appeared. Settings these two environment variables made my debugging as smooth as silk. – nutjob Oct 09 '20 at 18:31
  • 3
    `sys.stdin.reconfigure` is invalid on Python 3.9.0, it throws `AttributeError: 'StdInputFile' object has no attribute 'reconfigure'` – Suncatcher Dec 24 '20 at 10:13
  • 3
    On Windows 10, using GIT BASH, setting the env variables mentioned above did NOT work, however, setting the two lines in the actual python code file DID work: `sys.stdin.reconfigure(encoding='utf-8') sys.stdout.reconfigure(encoding='utf-8')` – Henrik Carlström Apr 12 '21 at 09:39
  • @Suncatcher Try to run this Python script in a different IDE – Petr L. Aug 15 '21 at 19:38
  • @PetrL. why I should use IDE at all? all valid Python commands should be interpretable in Python Shell, otherwise they are not valid – Suncatcher Aug 16 '21 at 10:33
  • @Suncatcher This Error in particular happens with IDLE (Python default IDE) – Petr L. Aug 17 '21 at 12:26
21

While saving the response of get request, same error was thrown on Python 3.7 on window 10. The response received from the URL, encoding was UTF-8 so it is always recommended to check the encoding so same can be passed to avoid such trivial issue as it really kills lots of time in production

import requests
resp = requests.get('https://en.wikipedia.org/wiki/NIFTY_50')
print(resp.encoding)
with open ('NiftyList.txt', 'w') as f:
    f.write(resp.text)

When I added encoding="utf-8" with the open command it saved the file with the correct response

with open ('NiftyList.txt', 'w', encoding="utf-8") as f:
    f.write(resp.text)
Suraj Rao
  • 28,850
  • 10
  • 94
  • 99
Abhishek Jain
  • 2,991
  • 1
  • 22
  • 21
14

Even I faced the same issue with the encoding that occurs when you try to print it, read/write it or open it. As others mentioned above adding .encoding="utf-8" will help if you are trying to print it.

soup.encode("utf-8")

If you are trying to open scraped data and maybe write it into a file, then open the file with (......,encoding="utf-8")

with open(filename_csv , 'w', newline='',encoding="utf-8") as csv_file:

Pardhu Gopalam
  • 179
  • 1
  • 6
7

For those still getting this error, adding encode("utf-8") to soup will also fix this.

soup = BeautifulSoup(html_doc, 'html.parser').encode("utf-8")
print(soup)
MilkyWay90
  • 1,773
  • 1
  • 8
  • 20
Pseudo Sudo
  • 1,376
  • 3
  • 14
  • 34
1

if you are using windows try to pass encoding='latin1', encoding='iso-8859-1' or encoding='cp1252' example:

csv_data = pd.read_csv(csvpath,encoding='iso-8859-1')
print(print(soup.encode('iso-8859-1')))
Karim Sherif
  • 335
  • 2
  • 5