59

I am reading multiple JSON objects into one DataFrame. The problem is that some of the columns are lists. Also, the data is very big and because of that I cannot use the available solutions on the internet. They are very slow and memory-inefficient

Here is how my data looks like:

df = pd.DataFrame({'A': ['x1','x2','x3', 'x4'], 'B':[['v1','v2'],['v3','v4'],['v5','v6'],['v7','v8']], 'C':[['c1','c2'],['c3','c4'],['c5','c6'],['c7','c8']],'D':[['d1','d2'],['d3','d4'],['d5','d6'],['d7','d8']], 'E':[['e1','e2'],['e3','e4'],['e5','e6'],['e7','e8']]})
    A       B          C           D           E
0   x1  [v1, v2]    [c1, c2]    [d1, d2]    [e1, e2]
1   x2  [v3, v4]    [c3, c4]    [d3, d4]    [e3, e4]
2   x3  [v5, v6]    [c5, c6]    [d5, d6]    [e5, e6]
3   x4  [v7, v8]    [c7, c8]    [d7, d8]    [e7, e8]

And this is the shape of my data: (441079, 12)

My desired output is:

    A       B          C           D           E
0   x1      v1         c1         d1          e1
0   x1      v2         c2         d2          e2
1   x2      v3         c3         d3          e3
1   x2      v4         c4         d4          e4
.....

EDIT: After being marked as duplicate, I would like to stress on the fact that in this question I was looking for an efficient method of exploding multiple columns. Therefore the approved answer is able to explode an arbitrary number of columns on very large datasets efficiently. Something that the answers to the other question failed to do (and that was the reason I asked this question after testing those solutions).

Henry Ecker
  • 31,792
  • 14
  • 29
  • 50
Moh
  • 1,507
  • 3
  • 14
  • 29
  • 2
    Yes, if you have `object` dtype columns containing Python `list`s, then everything will be slow and memory inefficient. The problem is likely better solved by never creating such a data-frame in the first place. – juanpa.arrivillaga Aug 23 '17 at 18:16
  • @juanpa.arrivillaga Can I read the JSON files in a different way not to create this mess? pd.read_csv has the option of defining convertors but I cannot find anything similar for pd.read_json – Moh Aug 23 '17 at 18:21
  • 1
    You will likely have to write something which munges your deserialized json data into something more manageable. – juanpa.arrivillaga Aug 23 '17 at 18:23
  • @juanpa.arrivillaga to my surprise, the answer is super efficient! – Moh Aug 23 '17 at 18:34
  • Related: [How to unnest (explode) a column in a pandas DataFrame?](https://stackoverflow.com/questions/53218931/how-to-unnest-explode-a-column-in-a-pandas-dataframe) – cs95 Dec 13 '19 at 21:29
  • https://stackoverflow.com/a/63476456/4533188 should be the accepted answer now IMHO. – Make42 Jun 21 '21 at 14:48

6 Answers6

92

pandas >= 0.25

Assuming all columns have the same number of lists, you can call Series.explode on each column.

df.set_index(['A']).apply(pd.Series.explode).reset_index()

    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

The idea is to set as the index all columns that must NOT be exploded first, then reset the index after.


It's also faster.

%timeit df.set_index(['A']).apply(pd.Series.explode).reset_index()
%%timeit
(df.set_index('A')
   .apply(lambda x: x.apply(pd.Series).stack())
   .reset_index()
   .drop('level_1', 1))


2.22 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.14 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Community
  • 1
  • 1
cs95
  • 330,695
  • 80
  • 606
  • 657
  • 2
    In my tests, this solution does not work when the size of the lists in the different columns is not the same. Otherwise it works like a charm! – bramb Mar 20 '20 at 08:34
  • 1
    Indeed, hence the disclaimer in sentence #1. – cs95 Mar 20 '20 at 14:34
  • @cs95 What if you have more than one non list column ..like in above example if i add one more column called F and its has 1,2,3,4 which non list type – Vikram Ranabhatt Mar 21 '20 at 13:12
  • @Chris_vr you can call set_index on all the non-list columns to protect them from the explode() operation. – cs95 Mar 21 '20 at 18:06
  • @cs95 I have SO https://stackoverflow.com/questions/60799274/explode-multiple-coumns-with-different-size-list-columns-in-pandas Could you please see if know the solution – Vikram Ranabhatt Mar 22 '20 at 12:28
  • @Chris_vr your question is a lot less trivial because the list items are not equally sized across columns in the same row. Ben Ts answer is probably as good as it'll get. – cs95 Mar 22 '20 at 18:22
  • 1
    This answer is pretty nice. Works for same length columns. If the length is not the same, then I think there will not be any standard answer anyway; depends on how you handle it. Thanks @cs95 – Vaibhav May 28 '20 at 06:34
  • 3
    This gives me `ValueError: cannot handle a non-unique multi-index!` However, @Zero's answer below works like a charm. – Sia Mar 10 '21 at 06:20
  • The first solution did not work for me, 2nd works like a charm – Akash Mar 26 '21 at 19:07
  • @Akash Sorry, which solution from my answer didn't work? Could you provide some sample data to debug as well? Thanks. – cs95 Mar 26 '21 at 20:13
  • I believe, size of one of the lists was different in the column that's the reason first solution did not work – Akash Mar 26 '21 at 20:21
24
def explode(df, lst_cols, fill_value=''):
    # make sure `lst_cols` is a list
    if lst_cols and not isinstance(lst_cols, list):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)

    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()

    if (lens > 0).all():
        # ALL lists in cells aren't empty
        return pd.DataFrame({
            col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
            for col in idx_cols
        }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
          .loc[:, df.columns]
    else:
        # at least one list in cells is empty
        return pd.DataFrame({
            col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
            for col in idx_cols
        }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
          .append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
          .loc[:, df.columns]

Usage:

In [82]: explode(df, lst_cols=list('BCDE'))
Out[82]:
    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8
MaxU - stop genocide of UA
  • 191,778
  • 30
  • 340
  • 375
  • 4
    we have this issue: https://github.com/pandas-dev/pandas/issues/8517#issuecomment-178848365 u can post your impl if u would like. if it's tested would take as a PR as well – Jeff Aug 23 '17 at 22:23
23

Use set_index on A and on remaining columns apply and stack the values. All of this condensed into a single liner.

In [1253]: (df.set_index('A')
              .apply(lambda x: x.apply(pd.Series).stack())
              .reset_index()
              .drop('level_1', 1))
Out[1253]:
    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8
Zero
  • 66,763
  • 15
  • 141
  • 151
  • I really like this answer for its simplicity. I was trying with similar approach, but couldn't make it work – Vaishali Aug 23 '17 at 18:52
  • 5
    This is a great answer! One small suggestion is to combine the last two commands as: `.reset_index(level=1, drop=True)` – bnaul Jun 11 '18 at 18:45
  • 1
    That does not work @bnaul, you need an additional: `.reset_index(level=1, drop=True).reset_index()` – Erfan Mar 15 '20 at 01:06
  • This is a generalized form of @cs95's answer on lists of same or different sizes. – 7kemZmani Nov 10 '20 at 06:05
15

Building on @cs95's answer, we can use an if clause in the lambda function, instead of setting all the other columns as the index. This has the following advantages:

  • Preserves column order
  • Lets you easily specify columns using the set you want to modify, x.name in [...], or not modify x.name not in [...].
df.apply(lambda x: x.explode() if x.name in ['B', 'C', 'D', 'E'] else x)

     A   B   C   D   E
0   x1  v1  c1  d1  e1
0   x1  v2  c2  d2  e2
1   x2  v3  c3  d3  e3
1   x2  v4  c4  d4  e4
2   x3  v5  c5  d5  e5
2   x3  v6  c6  d6  e6
3   x4  v7  c7  d7  e7
3   x4  v8  c8  d8  e8
benwshul
  • 186
  • 2
  • 6
7

As of pandas 1.3.0:

  • DataFrame.explode() now supports exploding multiple columns. Its column argument now also accepts a list of str or tuples for exploding on multiple columns at the same time (GH39240)

What’s new in 1.3.0 (July 2, 2021)


So now this operation is as simple as:

df.explode(['B', 'C', 'D', 'E'])
    A   B   C   D   E
0  x1  v1  c1  d1  e1
0  x1  v2  c2  d2  e2
1  x2  v3  c3  d3  e3
1  x2  v4  c4  d4  e4
2  x3  v5  c5  d5  e5
2  x3  v6  c6  d6  e6
3  x4  v7  c7  d7  e7
3  x4  v8  c8  d8  e8

Or if wanting unique indexing:

df.explode(['B', 'C', 'D', 'E'], ignore_index=True)
    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8
Henry Ecker
  • 31,792
  • 14
  • 29
  • 50
  • This is nice but it only works when columns have matching element counts. – Cam Sep 20 '21 at 21:48
  • As the docs say "all specified columns their list-like data __on same row__ of the frame must have matching length" [emphasis mine]. However, the [accepted answer](https://stackoverflow.com/a/59330040/15497888), the [explode function](https://stackoverflow.com/a/45846861/15497888), and [apply explode](https://stackoverflow.com/a/63476456/15497888) _also_ requires all list-like data in the same row to be the same length otherwise they raise one error or another. So this is as limited as most other approaches shown here. – Henry Ecker Sep 20 '21 at 22:04
  • 1
    Agreed it has the same limitation with less syntax, which is why I like it. Just wish it was easier to work with different length lists on the same row. – Cam Sep 20 '21 at 22:09
  • This operation doesn't really make sense with lists of different lengths. – Denziloe Apr 18 '22 at 20:31
0

Here is my solution using 'apply' function. Main features/differences:

  1. offer options to specify selected multiple columns or all columns
  2. offer options to specify values to fill in the 'missing' position (through parameter fill_mode = 'external'; 'internal'; or 'trim', explanation would be long, see examples below and try yourself to change the option and check the result)

Notes: option 'trim' was developed for my need, out of scope for this question

def cell_size_equalize2(row, cols='', fill_mode='internal', fill_value=''):
    jcols = [j for j,v in enumerate(row.index) if v in cols]
    if len(jcols)<1:
        jcols = range(len(row.index))
    Ls = [lenx(x) for x in row.values]
    if not Ls[:-1]==Ls[1:]:
        vals = [v if isinstance(v,list) else [v] for v in row.values]
        if fill_mode=='external':
            vals = [[e] + [fill_value]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
                    else e + [fill_value]*(max(Ls)-lenx(e))
                    for j,e in enumerate(vals)]
        elif fill_mode == 'internal':
            vals = [[e]+[e]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
                    else e+[e[-1]]*(max(Ls)-lenx(e)) 
                    for j,e in enumerate(vals)]
        else:
            vals = [e[0:min(Ls)] for e in vals]
        row = pd.Series(vals,index=row.index.tolist())
    return row

Examples:

df=pd.DataFrame({
    'a':[[1],2,3],
    'b':[[4,5,7],[5,4],4],
    'c':[[4,5],5,[6]]
})
print(df)
df1 = df.apply(cell_size_equalize2, cols='', fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'external\', all columns, fill_value = \'OK\'\n', df1)
df2 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'external\', cols = [\'a\', \'b\'], fill_value = \'OK\'\n', df2)
df3 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='internal', axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'internal\', cols = [\'a\', \'b\']\n', df3)
df4 = df.apply(cell_size_equalize2, cols='', fill_mode='trim', axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'trim\', all columns\n', df4)

Output:

     a          b       c
0  [1]  [4, 5, 7]  [4, 5]
1    2     [5, 4]       5
2    3          4     [6]

fill_mode='external', all columns, fill_value = 'OK'
     a  b   c
0   1  4   4
0  OK  5   5
0  OK  7  OK
1   2  5   5
1  OK  4  OK
2   3  4   6

fill_mode='external', cols = ['a', 'b'], fill_value = 'OK'
     a  b       c
0   1  4  [4, 5]
0  OK  5      OK
0  OK  7      OK
1   2  5       5
1  OK  4      OK
2   3  4       6

fill_mode='internal', cols = ['a', 'b']
    a  b       c
0  1  4  [4, 5]
0  1  5  [4, 5]
0  1  7  [4, 5]
1  2  5       5
1  2  4       5
2  3  4       6

fill_mode='trim', all columns
    a  b  c
0  1  4  4
1  2  5  5
2  3  4  6
nphaibk
  • 21
  • 4