0

I've imported a csv to a pandas dataframe using encoding='latin1' because UTF-8 caused errors. The data imports without error but I'll end up with ? characters instead of a more meaningful replacement

How can I clean up the data I've imported with a more accurate replacement characters? For example, the string 'Mořic', comes through as 'Mo?ic' instead of 'Moric' when I use pd.read_csv('data.csv',delimiter=',',encoding='latin1')

Using this article I've managed to get

import unicodedata
#this also works with 'niña'

example = 'Mořic'
nfd_example = unicodedata.normalize("NFD", str(example))
print('original: ',nfd_example)
print('cleaned: ',nfd_example.encode('latin1', 'ignore'))

Out:

original:  Mořic
cleaned:  b'Moric'

so I tried to apply this to my dataset using an adapted version of the code in this answer to give:

with open('data.csv', 'r', encoding='latin1') as f, open('data-fixed.csv', 'wb') as g:
    content = unicodedata.normalize("NFD",f.read())
    g.write(content.encode('latin1','ignore'))

df = pd.read_csv('data-fixed.csv',delimiter=',',encoding='latin1')

This works with 'niña' which comes through as 'nina', but others, e.g. 'Mořic' still 'Mo?ic'

Pandas 1.3.0 actually has a new feature encoding_errors but since I have an older version, unable to use this. Hence I am trying to apply the encode('latin1','ignore') approach from above, but not sure how to apply this to pd.read_csv() - perhaps there is a better way to do without using encode()?

Maitiu
  • 116
  • 1
  • 9
  • `latin1` contains character `ñ` (U+00F1) _Latin Small Letter N With Tilde_ unlike character `ř` (U+0159) _Latin Small Letter R With Caron_ (`ř` is present in Central European codepages `cp1250` and `cp852`). You see _Question Mark_ as a replacement character… – JosefZ Oct 17 '21 at 18:51

0 Answers0