Efficient way to unnest (explode) multiple list columns in a pandas DataFrame

Question

I am reading multiple JSON objects into one DataFrame. The problem is that some of the columns are lists. Also, the data is very big and because of that I cannot use the available solutions on the internet. They are very slow and memory-inefficient

Here is how my data looks like:

df = pd.DataFrame({'A': ['x1','x2','x3', 'x4'], 'B':[['v1','v2'],['v3','v4'],['v5','v6'],['v7','v8']], 'C':[['c1','c2'],['c3','c4'],['c5','c6'],['c7','c8']],'D':[['d1','d2'],['d3','d4'],['d5','d6'],['d7','d8']], 'E':[['e1','e2'],['e3','e4'],['e5','e6'],['e7','e8']]})
    A       B          C           D           E
0   x1  [v1, v2]    [c1, c2]    [d1, d2]    [e1, e2]
1   x2  [v3, v4]    [c3, c4]    [d3, d4]    [e3, e4]
2   x3  [v5, v6]    [c5, c6]    [d5, d6]    [e5, e6]
3   x4  [v7, v8]    [c7, c8]    [d7, d8]    [e7, e8]

And this is the shape of my data: (441079, 12)

My desired output is:

    A       B          C           D           E
0   x1      v1         c1         d1          e1
0   x1      v2         c2         d2          e2
1   x2      v3         c3         d3          e3
1   x2      v4         c4         d4          e4
.....

EDIT: After being marked as duplicate, I would like to stress on the fact that in this question I was looking for an efficient method of exploding multiple columns. Therefore the approved answer is able to explode an arbitrary number of columns on very large datasets efficiently. Something that the answers to the other question failed to do (and that was the reason I asked this question after testing those solutions).

Yes, if you have `object` dtype columns containing Python `list`s, then everything will be slow and memory inefficient. The problem is likely better solved by never creating such a data-frame in the first place. — juanpa.arrivillaga, Aug 23 '17 at 18:16
@juanpa.arrivillaga Can I read the JSON files in a different way not to create this mess? pd.read_csv has the option of defining convertors but I cannot find anything similar for pd.read_json — Moh, Aug 23 '17 at 18:21
You will likely have to write something which munges your deserialized json data into something more manageable. — juanpa.arrivillaga, Aug 23 '17 at 18:23
@juanpa.arrivillaga to my surprise, the answer is super efficient! — Moh, Aug 23 '17 at 18:34
Related: [How to unnest (explode) a column in a pandas DataFrame?](https://stackoverflow.com/questions/53218931/how-to-unnest-explode-a-column-in-a-pandas-dataframe) — cs95, Dec 13 '19 at 21:29
https://stackoverflow.com/a/63476456/4533188 should be the accepted answer now IMHO. — Make42, Jun 21 '21 at 14:48

score 92 · Accepted Answer · edited Jun 20 '20 at 09:12

92

pandas >= 0.25

Assuming all columns have the same number of lists, you can call Series.explode on each column.

df.set_index(['A']).apply(pd.Series.explode).reset_index()

    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

The idea is to set as the index all columns that must NOT be exploded first, then reset the index after.

It's also faster.

%timeit df.set_index(['A']).apply(pd.Series.explode).reset_index()
%%timeit
(df.set_index('A')
   .apply(lambda x: x.apply(pd.Series).stack())
   .reset_index()
   .drop('level_1', 1))


2.22 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.14 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Jun 20 '20 at 09:12

Community

1
1

answered Dec 13 '19 at 21:29

cs95

330,695
80
606
657

2

In my tests, this solution does not work when the size of the lists in the different columns is not the same. Otherwise it works like a charm! – bramb Mar 20 '20 at 08:34
1

Indeed, hence the disclaimer in sentence #1. – cs95 Mar 20 '20 at 14:34
@cs95 What if you have more than one non list column ..like in above example if i add one more column called F and its has 1,2,3,4 which non list type – Vikram Ranabhatt Mar 21 '20 at 13:12
@Chris_vr you can call set_index on all the non-list columns to protect them from the explode() operation. – cs95 Mar 21 '20 at 18:06
@cs95 I have SO https://stackoverflow.com/questions/60799274/explode-multiple-coumns-with-different-size-list-columns-in-pandas Could you please see if know the solution – Vikram Ranabhatt Mar 22 '20 at 12:28
@Chris_vr your question is a lot less trivial because the list items are not equally sized across columns in the same row. Ben Ts answer is probably as good as it'll get. – cs95 Mar 22 '20 at 18:22
1

This answer is pretty nice. Works for same length columns. If the length is not the same, then I think there will not be any standard answer anyway; depends on how you handle it. Thanks @cs95 – Vaibhav May 28 '20 at 06:34
3

This gives me `ValueError: cannot handle a non-unique multi-index!` However, @Zero's answer below works like a charm. – Sia Mar 10 '21 at 06:20
The first solution did not work for me, 2nd works like a charm – Akash Mar 26 '21 at 19:07
@Akash Sorry, which solution from my answer didn't work? Could you provide some sample data to debug as well? Thanks. – cs95 Mar 26 '21 at 20:13
I believe, size of one of the lists was different in the column that's the reason first solution did not work – Akash Mar 26 '21 at 20:21

score 24 · Answer 2 · answered Aug 23 '17 at 18:20

def explode(df, lst_cols, fill_value=''):
    # make sure `lst_cols` is a list
    if lst_cols and not isinstance(lst_cols, list):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)

    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()

    if (lens > 0).all():
        # ALL lists in cells aren't empty
        return pd.DataFrame({
            col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
            for col in idx_cols
        }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
          .loc[:, df.columns]
    else:
        # at least one list in cells is empty
        return pd.DataFrame({
            col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
            for col in idx_cols
        }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
          .append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
          .loc[:, df.columns]

Usage:

In [82]: explode(df, lst_cols=list('BCDE'))
Out[82]:
    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

we have this issue: https://github.com/pandas-dev/pandas/issues/8517#issuecomment-178848365 u can post your impl if u would like. if it's tested would take as a PR as well — Jeff, Aug 23 '17 at 22:23

Zero · Answer 3 · 2017-08-23T18:36:11.063

23

Use set_index on A and on remaining columns apply and stack the values. All of this condensed into a single liner.

In [1253]: (df.set_index('A')
              .apply(lambda x: x.apply(pd.Series).stack())
              .reset_index()
              .drop('level_1', 1))
Out[1253]:
    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

edited Aug 23 '17 at 18:36

answered Aug 23 '17 at 18:19

Zero

66,763
15
141
151

I really like this answer for its simplicity. I was trying with similar approach, but couldn't make it work – Vaishali Aug 23 '17 at 18:52
5

This is a great answer! One small suggestion is to combine the last two commands as: `.reset_index(level=1, drop=True)` – bnaul Jun 11 '18 at 18:45
1

That does not work @bnaul, you need an additional: `.reset_index(level=1, drop=True).reset_index()` – Erfan Mar 15 '20 at 01:06
This is a generalized form of @cs95's answer on lists of same or different sizes. – 7kemZmani Nov 10 '20 at 06:05

score 15 · Answer 4 · answered Aug 18 '20 at 20:52

Building on @cs95's answer, we can use an if clause in the lambda function, instead of setting all the other columns as the index. This has the following advantages:

Preserves column order
Lets you easily specify columns using the set you want to modify, x.name in [...], or not modify x.name not in [...].

df.apply(lambda x: x.explode() if x.name in ['B', 'C', 'D', 'E'] else x)

     A   B   C   D   E
0   x1  v1  c1  d1  e1
0   x1  v2  c2  d2  e2
1   x2  v3  c3  d3  e3
1   x2  v4  c4  d4  e4
2   x3  v5  c5  d5  e5
2   x3  v6  c6  d6  e6
3   x4  v7  c7  d7  e7
3   x4  v8  c8  d8  e8

score 7 · Answer 5 · answered Sep 01 '21 at 02:11

7

As of pandas 1.3.0:

DataFrame.explode() now supports exploding multiple columns. Its column argument now also accepts a list of str or tuples for exploding on multiple columns at the same time (GH39240)

What’s new in 1.3.0 (July 2, 2021)

So now this operation is as simple as:

df.explode(['B', 'C', 'D', 'E'])

    A   B   C   D   E
0  x1  v1  c1  d1  e1
0  x1  v2  c2  d2  e2
1  x2  v3  c3  d3  e3
1  x2  v4  c4  d4  e4
2  x3  v5  c5  d5  e5
2  x3  v6  c6  d6  e6
3  x4  v7  c7  d7  e7
3  x4  v8  c8  d8  e8

Or if wanting unique indexing:

df.explode(['B', 'C', 'D', 'E'], ignore_index=True)

    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

answered Sep 01 '21 at 02:11

Henry Ecker

31,792
14
29
50

This is nice but it only works when columns have matching element counts. – Cam Sep 20 '21 at 21:48
As the docs say "all specified columns their list-like data __on same row__ of the frame must have matching length" [emphasis mine]. However, the [accepted answer](https://stackoverflow.com/a/59330040/15497888), the [explode function](https://stackoverflow.com/a/45846861/15497888), and [apply explode](https://stackoverflow.com/a/63476456/15497888) _also_ requires all list-like data in the same row to be the same length otherwise they raise one error or another. So this is as limited as most other approaches shown here. – Henry Ecker Sep 20 '21 at 22:04
1

Agreed it has the same limitation with less syntax, which is why I like it. Just wish it was easier to work with different length lists on the same row. – Cam Sep 20 '21 at 22:09
This operation doesn't really make sense with lists of different lengths. – Denziloe Apr 18 '22 at 20:31

score 0 · Answer 6 · answered Jul 02 '21 at 07:58

Here is my solution using 'apply' function. Main features/differences:

offer options to specify selected multiple columns or all columns
offer options to specify values to fill in the 'missing' position (through parameter fill_mode = 'external'; 'internal'; or 'trim', explanation would be long, see examples below and try yourself to change the option and check the result)

Notes: option 'trim' was developed for my need, out of scope for this question

def cell_size_equalize2(row, cols='', fill_mode='internal', fill_value=''):
    jcols = [j for j,v in enumerate(row.index) if v in cols]
    if len(jcols)<1:
        jcols = range(len(row.index))
    Ls = [lenx(x) for x in row.values]
    if not Ls[:-1]==Ls[1:]:
        vals = [v if isinstance(v,list) else [v] for v in row.values]
        if fill_mode=='external':
            vals = [[e] + [fill_value]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
                    else e + [fill_value]*(max(Ls)-lenx(e))
                    for j,e in enumerate(vals)]
        elif fill_mode == 'internal':
            vals = [[e]+[e]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
                    else e+[e[-1]]*(max(Ls)-lenx(e)) 
                    for j,e in enumerate(vals)]
        else:
            vals = [e[0:min(Ls)] for e in vals]
        row = pd.Series(vals,index=row.index.tolist())
    return row

Examples:

df=pd.DataFrame({
    'a':[[1],2,3],
    'b':[[4,5,7],[5,4],4],
    'c':[[4,5],5,[6]]
})
print(df)
df1 = df.apply(cell_size_equalize2, cols='', fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'external\', all columns, fill_value = \'OK\'\n', df1)
df2 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'external\', cols = [\'a\', \'b\'], fill_value = \'OK\'\n', df2)
df3 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='internal', axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'internal\', cols = [\'a\', \'b\']\n', df3)
df4 = df.apply(cell_size_equalize2, cols='', fill_mode='trim', axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'trim\', all columns\n', df4)

Output:

     a          b       c
0  [1]  [4, 5, 7]  [4, 5]
1    2     [5, 4]       5
2    3          4     [6]

fill_mode='external', all columns, fill_value = 'OK'
     a  b   c
0   1  4   4
0  OK  5   5
0  OK  7  OK
1   2  5   5
1  OK  4  OK
2   3  4   6

fill_mode='external', cols = ['a', 'b'], fill_value = 'OK'
     a  b       c
0   1  4  [4, 5]
0  OK  5      OK
0  OK  7      OK
1   2  5       5
1  OK  4      OK
2   3  4       6

fill_mode='internal', cols = ['a', 'b']
    a  b       c
0  1  4  [4, 5]
0  1  5  [4, 5]
0  1  7  [4, 5]
1  2  5       5
1  2  4       5
2  3  4       6

fill_mode='trim', all columns
    a  b  c
0  1  4  4
1  2  5  5
2  3  4  6

Efficient way to unnest (explode) multiple list columns in a pandas DataFrame

6 Answers6

pandas >= 0.25

Linked

Related