I've imported a csv to a pandas dataframe using encoding='latin1' because UTF-8 caused errors. The data imports without error but I'll end up with ? characters instead of a more meaningful replacement
How can I clean up the data I've imported with a more accurate replacement characters? For example, the string 'Mořic', comes through as 'Mo?ic' instead of 'Moric' when I use pd.read_csv('data.csv',delimiter=',',encoding='latin1')
Using this article I've managed to get
import unicodedata
#this also works with 'niña'
example = 'Mořic'
nfd_example = unicodedata.normalize("NFD", str(example))
print('original: ',nfd_example)
print('cleaned: ',nfd_example.encode('latin1', 'ignore'))
Out:
original: Mořic
cleaned: b'Moric'
so I tried to apply this to my dataset using an adapted version of the code in this answer to give:
with open('data.csv', 'r', encoding='latin1') as f, open('data-fixed.csv', 'wb') as g:
content = unicodedata.normalize("NFD",f.read())
g.write(content.encode('latin1','ignore'))
df = pd.read_csv('data-fixed.csv',delimiter=',',encoding='latin1')
This works with 'niña' which comes through as 'nina', but others, e.g. 'Mořic' still 'Mo?ic'
Pandas 1.3.0 actually has a new feature encoding_errors but since I have an older version, unable to use this. Hence I am trying to apply the encode('latin1','ignore') approach from above, but not sure how to apply this to pd.read_csv() - perhaps there is a better way to do without using encode()?