Read CSV with JSON feature

Question

I am trying to read a large CSV which includes JSON features (location here). For the first, say 100 lines, the file looks like this:

Time,location,labelA,labelB
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan

I followed this question to parse the location column. The solution basically defines a helper as:

def CustomParser(data):
    import json
    j1 = json.loads(data)
    return j1

and then

df=pd.read_csv('data.csv', nrows=100,converters={'location':CustomParser},header=0)

I get the following error which is related to JSON format:

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Q1: How can I parse the feature location onto new columns?

Q2 (for general case): For nrows>100 in the data, also the last features (labelA and labelB) have JSON formats with different key and value. How can I possibly read the entire CSV with parsing every feature which includes JSON (even partially)?

Trenton McKinney · Answer 1 · 2019-09-18T20:32:24.130

Fix the file:

Unfortunately, the file is difficult to read because each row contains a dict, whose key-value pairs are separated by commas.
The easiest way to resolve the issue, is change the separators outside of each dict, from , to |.
The following code will read the existing file
- It assumes, the first row is the header, use .replace(',', '|')
- Remaining rows will use a regular expression to replace , outside of {}
- Each line will be written to a new file.

Code:

Data:

Time,location,labelA,labelB
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},{"ack":123,"bar":456},{"foo":123,"bar":456}
2019-09-10,{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8},nan,nan

Path.cwd() assumes current working directory, if this is not to case:
- Path('c:/some_path_to_my_file') / 'file_name.poo' can be used
pathlib is part of the standard library
Python 3's pathlib Module: Taming the File System

File repair:

import re
from pathlib import Path

p = Path.cwd() / 'test.csv'
p2 = Path.cwd() / 'test2.csv'

with p.open('r') as f:
    with p2.open('w') as f2:
        for cnt, line in enumerate(f):
            if cnt == 0:
                line = line.replace(',', '|')
            else:
                line = re.sub(r',(?=(((?!\}).)*\{)|[^\{\}]*$)', '|', line)
            f2.write(line)

New file:

Time|location|labelA|labelB
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|{"ack":123,"bar":456}|{"foo":123,"bar":456}
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan

Parse the new file:

Now the columns will be properly separated by .read_csv
However, the location, labelA and labelB columns are str
- Use ast.literal_eval to convert to dict
- literal_eval won't work on nan, so replace nan with {}
for col in df.columns[1:]: loops through each of the columns and:
- try-except will catch any columns that are not properly formed
- converts them from str to dict
- separates the keys into columns
- concats the columns to the existing dataframe
- drops the old column

import pandas as pd
from ast import literal_eval

df = pd.read_csv('test2.csv', sep='|')
print(df)

       Time                                                             location                 labelA                 labelB
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}  {"ack":123,"bar":456}  {"foo":123,"bar":456}
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}                    NaN                    NaN
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}  {"ack":123,"bar":456}  {"foo":123,"bar":456}
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}                    NaN                    NaN
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}  {"ack":123,"bar":456}  {"foo":123,"bar":456}
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}                    NaN                    NaN
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}  {"ack":123,"bar":456}  {"foo":123,"bar":456}
 2019-09-10  {"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}                    NaN                    NaN


for col in df.columns[1:]:
    try:
        df[col].fillna('{}', inplace=True)
        df[col] = df[col].apply(literal_eval)
        df = pd.concat([df, df[col].apply(pd.Series)], axis=1)
        df.drop(columns=[col], inplace=True)
    except (SyntaxError, ValueError) as e:
        print(f'{col}: {e}')


print(df)

       Time   lng    alt        time  error   lat    ack    bar    foo    bar
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8  123.0  456.0  123.0  456.0
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8    NaN    NaN    NaN    NaN
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8  123.0  456.0  123.0  456.0
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8    NaN    NaN    NaN    NaN
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8  123.0  456.0  123.0  456.0
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8    NaN    NaN    NaN    NaN
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8  123.0  456.0  123.0  456.0
 2019-09-10  12.9  413.0  2019-09-10    7.0  17.8    NaN    NaN    NaN    NaN

Literal Eval Notes:

Pandas has methods for importing data in many forms, such as dict or list.
However, read_csv doesn't interprete containers (e.g. dict) well, they are interpreted as a string, unless you specify the converters parameter (pd.read_csv('test3.csv', sep='|', converters={'a': literal_eval}).
literal_eval will not work on a column comprised of both containers and strings or NaN, unless the string is only numeric (e.g. '8654')
Part of the code above, first replaced all nan with a {} so literal_eval wouldn't have an error.
Given the following mixed column example:

column_a
{"ack":123,"bar":456}
some string
{"ack":123,"bar":456}
some string
{"ack":123,"bar":456}
some string

literal_eval will throw ValueError: malformed node or string:
- This difference between the two solutions is the other solution fixes one column, whereas this solution was implemented in such a way as to fix all the columns and remove the necessity of reading only the first 100 rows.
- You can forgo the loop to fix all the columns and just fix the location column, if it is all dicts. Use the following code:

df['location'] = df['location'].apply(literal_eval)
df = pd.concat([df, df['location'].apply(pd.Series)], axis=1)

Note about the actual data:

the location column is not formed properly
- '{"lng":12.9975201,alt:413.0,"time:""2019-09-10T12:09:58Z""",error:7.0,lat:47.8258582}'
Here is the expected form:
- '{"lng":12.9975201,"alt":413.0,"time":"2019-09-10T12:09:58Z","error":7.0,"lat":47.8258582}'

Fix the `location` column:

The location column is Position in the real data

def fix_pos(x):
    word_dict = {'alt': '"alt"',
                 '"time:"': '"time":',
                 '"",error:': ',"error":',
                 'lat': '"lat"'}
    for k, v in word_dict.items():
        x = x.replace(k, v)
    return x

df.Position = df.Position.apply(lambda x: fix_pos(x))

Use the following loop with the real data file.
Zeit, device, Text & Type don't need to be processed
Position is at index 4.

for col in df.columns[4:]:
    try:
        df[col].fillna('{}', inplace=True)
        df[col] = df[col].apply(literal_eval)
        df = pd.concat([df, df[col].apply(pd.Series)], axis=1)
        df.drop(columns=[col], inplace=True)
    except (SyntaxError, ValueError) as e:
        print(f'{col}: {e}')

The loop that applies literal_eval to all columns has been updated with try-except
- If there's an exception the column name and error message will be printed out.
- There are a total of 64 columns in the real data, most of them are Furchtbar.

Errors:

These are the errors for all the columns in the supplied csv file.

device: unexpected EOF while parsing (<unknown>, line 1)
Text: malformed node or string: <_ast.Name object at 0x00000203B8473C08>
Typ: malformed node or string: <_ast.Name object at 0x00000203BE217E08>
Data: unexpected EOF while parsing (<unknown>, line 1)
Data1: invalid syntax (<unknown>, line 1)
Data2: invalid syntax (<unknown>, line 1)
Unnamed: 8: invalid syntax (<unknown>, line 1)
Unnamed: 9: unexpected EOF while parsing (<unknown>, line 1)
Unnamed: 10: invalid syntax (<unknown>, line 1)
Unnamed: 11: unexpected EOF while parsing (<unknown>, line 1)
Unnamed: 12: invalid syntax (<unknown>, line 1)
Unnamed: 13: invalid syntax (<unknown>, line 1)
Unnamed: 14: invalid syntax (<unknown>, line 1)
Unnamed: 15: invalid syntax (<unknown>, line 1)
Unnamed: 16: invalid syntax (<unknown>, line 1)
Unnamed: 17: invalid syntax (<unknown>, line 1)
Unnamed: 18: invalid syntax (<unknown>, line 1)
Unnamed: 19: invalid syntax (<unknown>, line 1)
Unnamed: 20: invalid syntax (<unknown>, line 1)
Unnamed: 21: unexpected EOF while parsing (<unknown>, line 1)
Unnamed: 22: invalid syntax (<unknown>, line 1)
Unnamed: 23: invalid syntax (<unknown>, line 1)
Unnamed: 24: invalid syntax (<unknown>, line 1)
Unnamed: 25: invalid syntax (<unknown>, line 1)
Unnamed: 26: invalid syntax (<unknown>, line 1)
Unnamed: 27: invalid syntax (<unknown>, line 1)

THANK YOU!! I wish I could accept both replies. I used actually your detailed reply to combine with the 2nd answer. Specially your regex, wonderful! The only place which I got to difficulty is df[col] = df[col].apply(literal_eval) which gives me an error for my real data. it is certainly due to the data, however could you please clarify this step? and possibly write an alternative for it? Thanks again. — physiker, Sep 18 '19 at 13:28
I think the main problem is with the Date because some are NaN but from 78 row is json. I will spend time more tomorrow to follow your comments. got the data? so i will delete the link — physiker, Sep 18 '19 at 19:43
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/199650/discussion-between-physiker-and-trenton-m). — physiker, Sep 18 '19 at 19:46

pault · Accepted Answer · 2019-09-18T15:29:49.623

The problem here is that the commas inside your json string are being treated as delimiters. You should modify the input data (if you don't have direct access to the file, you can always read the contents into a list of strings using open first).

Here are a few modification options that you can try:

Option 1: Quote json string with single quote

Use a single quote (or another character that doesn't otherwise appear in your data) as a quote character for your json string.

>> cat data.csv
Time,location,labelA,labelB
2019-09-10,'{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}',nan,nan

Then use quotechar="'" when you read the data:

import pandas as pd
import json

df=pd.read_csv('data.csv', converters={'location':json.loads}, header=0, quotechar="'")

Option 2: Quote json string with double quote and escape

If the single quote can't be used, you can actually use the double quote as the quotechar, as long as your escape the quotes inside the json string:

>> cat data.csv
Time,location,labelA,labelB
2019-09-10,"{""lng"":12.9,""alt"":413.0,""time"":""2019-09-10"",""error"":7.0,""lat"":17.8}",nan,nan

Notice that this now matches the format of the question you linked.

df=pd.read_csv('data.csv', converters={'location':json.loads}, header=0, quotechar='"')

Option 3: Change the delimiter

Use a different character, for example the | as the delimiter

>> cat data.csv
Time|location|labelA|labelB
2019-09-10|{"lng":12.9,"alt":413.0,"time":"2019-09-10","error":7.0,"lat":17.8}|nan|nan

Now use the sep argument to specify the new delimiter:

df=pd.read_csv('data.csv', converters={'location':json.loads}, header=0, sep="|")

Each of these methods produce the same output:

print(df)
#   Time        location                                            labelA  labelB
#0  2019-09-10  {u'lat': 17.8, u'lng': 12.9, u'error': 7.0, u'...   NaN     NaN

Once you have that, you can expand the location column using one of the methods described in Flatten JSON column in a Pandas DataFrame

new_df = df.join(pd.io.json.json_normalize(df["location"])).drop(["location"], axis=1)
print(new_df)
#   Time        labelA  labelB  alt    error  lat   lng   time
#0  2019-09-10  NaN     NaN     413.0  7.0    17.8  12.9  2019-09-10

Thank you very much for your detailed reply. very helpful specially when I combine with the other answer as well. — physiker, Sep 18 '19 at 13:24

Read CSV with JSON feature

2 Answers2

Fix the file:

Code:

Data:

File repair:

New file:

Parse the new file:

Literal Eval Notes:

Note about the actual data:

Fix the `location` column:

Errors:

Linked

Related

Read CSV with JSON feature

2 Answers2

Fix the file:

Code:

Data:

File repair:

New file:

Parse the new file:

Literal Eval Notes:

Note about the actual data:

Fix the location column:

Errors:

Linked

Related

Fix the `location` column: