Fix the file:
- Unfortunately, the file is difficult to read because each row contains a
dict, whose key-value pairs are separated by commas.
- The easiest way to resolve the issue, is change the separators outside of each
dict, from , to |.
- The following code will read the existing file
- It assumes, the first row is the header, use
.replace(',', '|')
- Remaining rows will use a regular expression to replace
, outside of {}
- Each line will be written to a new file.
Code:
Data:
Time,location,labelA,labelB
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
File repair:
import re
from pathlib import Path
p = Path.cwd() / 'test.csv'
p2 = Path.cwd() / 'test2.csv'
with p.open('r') as f:
with p2.open('w') as f2:
for cnt, line in enumerate(f):
if cnt == 0:
line = line.replace(',', '|')
else:
line = re.sub(r',(?=(((?!\}).)*\{)|[^\{\}]*$)', '|', line)
f2.write(line)
New file:
Time|location|labelA|labelB
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
Parse the new file:
- Now the columns will be properly separated by
.read_csv
- However, the
location, labelA and labelB columns are str
- Use
ast.literal_eval to convert to dict
literal_eval won't work on nan, so replace nan with {}
for col in df.columns[1:]: loops through each of the columns and:
try-except will catch any columns that are not properly formed
- converts them from
str to dict
- separates the
keys into columns
concats the columns to the existing dataframe
drops the old column
import pandas as pd
from ast import literal_eval
df = pd.read_csv('test2.csv', sep='|')
print(df)
Time location labelA labelB
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} {"ack":123,"bar":456} {"foo":123,"bar":456}
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} NaN NaN
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} {"ack":123,"bar":456} {"foo":123,"bar":456}
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} NaN NaN
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} {"ack":123,"bar":456} {"foo":123,"bar":456}
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} NaN NaN
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} {"ack":123,"bar":456} {"foo":123,"bar":456}
2019-09-10 {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8} NaN NaN
for col in df.columns[1:]:
try:
df[col].fillna('{}', inplace=True)
df[col] = df[col].apply(literal_eval)
df = pd.concat([df, df[col].apply(pd.Series)], axis=1)
df.drop(columns=[col], inplace=True)
except (SyntaxError, ValueError) as e:
print(f'{col}: {e}')
print(df)
Time lng alt time error lat ack bar foo bar
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 123.0 456.0 123.0 456.0
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 NaN NaN NaN NaN
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 123.0 456.0 123.0 456.0
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 NaN NaN NaN NaN
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 123.0 456.0 123.0 456.0
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 NaN NaN NaN NaN
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 123.0 456.0 123.0 456.0
2019-09-10 12.9 413.0 2019-09-10 7.0 17.8 NaN NaN NaN NaN
Literal Eval Notes:
- Pandas has methods for importing data in many forms, such as
dict or list.
- However,
read_csv doesn't interprete containers (e.g. dict) well, they are interpreted as a string, unless you specify the converters parameter (pd.read_csv('test3.csv', sep='|', converters={'a': literal_eval}).
literal_eval will not work on a column comprised of both containers and strings or NaN, unless the string is only numeric (e.g. '8654')
- Part of the code above, first replaced all
nan with a {} so literal_eval wouldn't have an error.
- Given the following mixed column example:
column_a
{"ack":123,"bar":456}
some string
{"ack":123,"bar":456}
some string
{"ack":123,"bar":456}
some string
literal_eval will throw ValueError: malformed node or string:
- This difference between the two solutions is the other solution fixes one column, whereas this solution was implemented in such a way as to fix all the columns and remove the necessity of reading only the first 100 rows.
- You can forgo the loop to fix all the columns and just fix the
location column, if it is all dicts. Use the following code:
df['location'] = df['location'].apply(literal_eval)
df = pd.concat([df, df['location'].apply(pd.Series)], axis=1)
Note about the actual data:
- the
location column is not formed properly
'{"lng":12.9975201,alt:413.0,"time:""2019-09-10T12:09:58Z""",error:7.0,lat:47.8258582}'
- Here is the expected form:
'{"lng":12.9975201,"alt":413.0,"time":"2019-09-10T12:09:58Z","error":7.0,"lat":47.8258582}'
Fix the location column:
- The
location column is Position in the real data
def fix_pos(x):
word_dict = {'alt': '"alt"',
'"time:"': '"time":',
'"",error:': ',"error":',
'lat': '"lat"'}
for k, v in word_dict.items():
x = x.replace(k, v)
return x
df.Position = df.Position.apply(lambda x: fix_pos(x))
- Use the following loop with the real data file.
Zeit, device, Text & Type don't need to be processed
Position is at index 4.
for col in df.columns[4:]:
try:
df[col].fillna('{}', inplace=True)
df[col] = df[col].apply(literal_eval)
df = pd.concat([df, df[col].apply(pd.Series)], axis=1)
df.drop(columns=[col], inplace=True)
except (SyntaxError, ValueError) as e:
print(f'{col}: {e}')
- The loop that applies
literal_eval to all columns has been updated with try-except
- If there's an
exception the column name and error message will be printed out.
- There are a total of 64 columns in the real data, most of them are Furchtbar.
Errors:
- These are the errors for all the columns in the supplied
csv file.
device: unexpected EOF while parsing (<unknown>, line 1)
Text: malformed node or string: <_ast.Name object at 0x00000203B8473C08>
Typ: malformed node or string: <_ast.Name object at 0x00000203BE217E08>
Data: unexpected EOF while parsing (<unknown>, line 1)
Data1: invalid syntax (<unknown>, line 1)
Data2: invalid syntax (<unknown>, line 1)
Unnamed: 8: invalid syntax (<unknown>, line 1)
Unnamed: 9: unexpected EOF while parsing (<unknown>, line 1)
Unnamed: 10: invalid syntax (<unknown>, line 1)
Unnamed: 11: unexpected EOF while parsing (<unknown>, line 1)
Unnamed: 12: invalid syntax (<unknown>, line 1)
Unnamed: 13: invalid syntax (<unknown>, line 1)
Unnamed: 14: invalid syntax (<unknown>, line 1)
Unnamed: 15: invalid syntax (<unknown>, line 1)
Unnamed: 16: invalid syntax (<unknown>, line 1)
Unnamed: 17: invalid syntax (<unknown>, line 1)
Unnamed: 18: invalid syntax (<unknown>, line 1)
Unnamed: 19: invalid syntax (<unknown>, line 1)
Unnamed: 20: invalid syntax (<unknown>, line 1)
Unnamed: 21: unexpected EOF while parsing (<unknown>, line 1)
Unnamed: 22: invalid syntax (<unknown>, line 1)
Unnamed: 23: invalid syntax (<unknown>, line 1)
Unnamed: 24: invalid syntax (<unknown>, line 1)
Unnamed: 25: invalid syntax (<unknown>, line 1)
Unnamed: 26: invalid syntax (<unknown>, line 1)
Unnamed: 27: invalid syntax (<unknown>, line 1)