1

I have a problem, i trying to get a string to be equel in python3 and in mysql, the problem is i expect its shut be utf-8 but the problem is its not the same.

i have this string

station√¶r pc > station√¶r pc

and what i wich now is its shut look like this

stationr pc > stationr pc

and i have try to use bytes(string, 'utf-8').decode('utf-8') and a lots of orther things.

hope one here can help me to strip all the wirdt charters out of my strings so i can use it better, the problem is the data coming from extenrel files and i can't control the encoding.

ParisNakitaKejser
  • 9,602
  • 9
  • 44
  • 64
  • 1
    Shouldn't this actually be "stationær pc"? This looks exactly like mojibake for interpreting UTF-8 data with the Mac Roman codec. I can reproduce it with `'stationær'.encode('utf8').decode('macroman')`. – lenz Jan 10 '18 at 13:09
  • In general, there's no need to control the encoding of input data. It's important to *know* what encoding was used, then you can always decode accordingly. – lenz Jan 10 '18 at 13:16
  • If you really want to convert "stationær pc" to "stationr pc", you can do `"stationær pc".encode('ascii', errors='ignore').decode('ascii')`. – lenz Jan 10 '18 at 13:20
  • thanks yeah its working this way, need to ignore it by using bytes(cat['Title'],'utf-8').decode('utf8').encode('ascii', errors='ignore').strip() thanks a lot :) will you make a anwser? – ParisNakitaKejser Jan 10 '18 at 13:23
  • I'm sure there are dozens of duplicates of this question, no need for another duplicate answer. Searching for "python remove non-ascii characters", I found [this answer](https://stackoverflow.com/a/18430817/1698431), for example. – lenz Jan 10 '18 at 13:43
  • 1
    Btw, `bytes(x, 'utf8').decode('utf8') == x` for any x, so you can skip that. – lenz Jan 10 '18 at 13:44

1 Answers1

0

As lenz found out, you have "Mojibake" with CHARACTER SET macroman versus utf8.

See this for ways that Mojibake can happen. (It reads "latin1" instead of "macroman".)

Rick James
  • 122,779
  • 10
  • 116
  • 195