Split a Pandas column of lists into multiple columns

Question

I have a Pandas DataFrame with one column:

df = pd.DataFrame({"teams": [["SF", "NYG"] for _ in range(7)]})

       teams
0  [SF, NYG]
1  [SF, NYG]
2  [SF, NYG]
3  [SF, NYG]
4  [SF, NYG]
5  [SF, NYG]
6  [SF, NYG]

How can split this column of lists into two columns?

Desired result:

  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

score 404 · Accepted Answer · edited Jul 24 '21 at 13:26

404

You can use the DataFrame constructor with lists created by to_list:

import pandas as pd

d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
                ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
print (df2)
       teams
0  [SF, NYG]
1  [SF, NYG]
2  [SF, NYG]
3  [SF, NYG]
4  [SF, NYG]
5  [SF, NYG]
6  [SF, NYG]

df2[['team1','team2']] = pd.DataFrame(df2.teams.tolist(), index= df2.index)
print (df2)
       teams team1 team2
0  [SF, NYG]    SF   NYG
1  [SF, NYG]    SF   NYG
2  [SF, NYG]    SF   NYG
3  [SF, NYG]    SF   NYG
4  [SF, NYG]    SF   NYG
5  [SF, NYG]    SF   NYG
6  [SF, NYG]    SF   NYG

And for a new DataFrame:

df3 = pd.DataFrame(df2['teams'].to_list(), columns=['team1','team2'])
print (df3)
  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

A solution with apply(pd.Series) is very slow:

#7k rows
df2 = pd.concat([df2]*1000).reset_index(drop=True)

In [121]: %timeit df2['teams'].apply(pd.Series)
1.79 s ± 52.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [122]: %timeit pd.DataFrame(df2['teams'].to_list(), columns=['team1','team2'])
1.63 ms ± 54.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

edited Jul 24 '21 at 13:26

Peter Mortensen

30,030
21
100
124

answered Feb 18 '16 at 20:06

jezrael

729,927
78
1,141
1,090

4

Minor caveat, if you are using it on existing dataframe, make sure to reset index, otherwise it will not assign correctly. – user1700890 Nov 06 '17 at 15:16
2

@user1700890 - yes, or specify index in DataFrame constructor `df2[['team1','team2']] = pd.DataFrame(df2.teams.values.tolist(), index= df2.index)` – jezrael Nov 06 '17 at 15:18
It seems to me that `apply()` is likely to be slow for the most part. Should I avoid using the function if I have alternatives ? – Catbuilts Nov 20 '18 at 11:08
1

@Catbuilts - yes, if exist vectorize solution the best avoid it. – jezrael Nov 20 '18 at 11:08
@jezrael: Thanks for the advice. Btw, what do you mean by *vectorize solution*. `numpy` is a way of vectorize solution, right? What else can be considered as vectorize solution ? Thanks – Catbuilts Nov 20 '18 at 11:19
1

@Catbuilts - yes, obviously. Vectorized means generally no loops, so no apply, no for, no list comprehensions. But it depends what need exactly. Maybe also help [this](https://stackoverflow.com/questions/24870953/does-iterrows-have-performance-issues/24871316#24871316) – jezrael Nov 20 '18 at 11:21
2

@Catbuilts Indeed `apply()` might be slower but is the go-to method when input string and values are not equal across rows of the original Series! – CheTesta Feb 11 '19 at 09:31
Here you know what the list length is and supply the new column names. What do you do if the list length varies? Do you find the max length of a list within the column then add enough columns for that and find a way to both generate new names and assign values for each row across as many columns as applicable? Do you know of an example of how to solve this please? +1 btw – QHarr Jun 22 '19 at 08:00
1

@QHarr - Then use `pd.DataFrame(df2['teams'].values.tolist()).add_prefix('col')` – jezrael Jun 22 '19 at 08:02
@jezrael Can you update this to reflect the fact that Series have their own [`to_list()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.to_list.html) method? – AMC May 01 '20 at 11:21
@AMC - Yop, there is also [tolist](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.tolist.html?highlight=tolist#pandas.Series.tolist) – jezrael May 01 '20 at 11:30
@jezrael I believe that `to_list` is currently just an alias for `tolist` anyway. – AMC May 01 '20 at 11:32
@AMC - hmmm, there is some reason for prioritize this one? What so you think? Btw, downvoting if you it was not so nice :( – jezrael May 01 '20 at 11:34
@AMC - Also not [this](https://stackoverflow.com/a/61538011/2901002) ? – jezrael May 01 '20 at 11:39
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/212915/discussion-between-amc-and-jezrael). – AMC May 01 '20 at 11:40
1

Read your post regarding the performance of `df.apply(pd.Series)`, Nice post, Thanks for suggesting the post, its always good to learn stuff from you +1. – Shubham Sharma Jun 04 '20 at 13:41
@jezrael Using `tolist` vs `apply` seems to be converting my `np.nan` into None? It seems I can't do a simple replace df.replace(None, np.nan) – MasayoMusic Jul 16 '20 at 05:49

score 89 · Answer 2 · edited May 01 '20 at 11:08

89

Much simpler solution:

pd.DataFrame(df2["teams"].to_list(), columns=['team1', 'team2'])

Yields,

  team1 team2
-------------
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG
7    SF   NYG

If you wanted to split a column of delimited strings rather than lists, you could similarly do:

pd.DataFrame(df["teams"].str.split('<delim>', expand=True).values,
             columns=['team1', 'team2'])

edited May 01 '20 at 11:08

AMC

2,535
7
12
34

answered Jun 15 '18 at 17:03

Joe Davison

1,419
1
12
11

22

what if each list has uneven number of elements? – ikel Nov 03 '19 at 17:00
5

_If you wanted to split a column of delimited strings rather than lists, you could similarly do:_ `df["teams"].str.split('', expand=True)` already returns a DataFrame, so it would probably be simpler to just rename the columns. – AMC May 01 '20 at 11:11
@AMC Your solution is exactly what I've been looking for. Thanks! – Huy Truong Jan 27 '22 at 18:11

score 57 · Answer 3 · answered Aug 14 '19 at 16:24

57

This solution preserves the index of the df2 DataFrame, unlike any solution that uses tolist():

df3 = df2.teams.apply(pd.Series)
df3.columns = ['team1', 'team2']

Here's the result:

  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

answered Aug 14 '19 at 16:24

Kevin Markham

5,128
1
24
36

9

Also one of the slowest `apply` you can do in pandas. You should avoid this method and use the accepted answer. In the timings of the top answer, this method is approx `1400 x` slower @rajan – Erfan Oct 17 '19 at 11:43
9

@Erfan Yes, but sometimes the user doesn't care whether an operation takes 1s or 1ms, and instead they care most about writing the simplest, most readable code! I acknowledge that readability/simplicity is subjective, but my point is simply that speed is not a priority for all users at all times. – Kevin Markham Oct 17 '19 at 13:17
2

Furthermore, I found out that the `apply` method works more reliably for expanding large arrays (1000+ items) on large data sets. The `tolist()` method killed my process when the data set exceeded 500k rows. – moritz Jan 22 '20 at 15:58
6

This is a great solution because it works well with lists of different sizes. – dasilvadaniel Feb 16 '20 at 19:47
2

@KevinMarkham _they care most about writing the simplest, most readable code_ Is `pd.DataFrame(df["teams"].to_list(), columns=["team_1", "team_2"])` really so much more complicated? – AMC May 01 '20 at 11:17
The only solution that worked for me when my data frame had nans. – Fijoy Vadakkumpadan Oct 19 '21 at 22:11
The .tolist() approach failed with my dataframe. My dataframe had mixed contents in the column of integers and lists. The .tolist() method did not break the lists into new columns. – Foggy Apr 25 '22 at 17:06

score 20 · Answer 4 · answered Jan 09 '18 at 11:53

20

There seems to be a syntactically simpler way, and therefore easier to remember, as opposed to the proposed solutions. I'm assuming that the column is called 'meta' in a dataframe df:

df2 = pd.DataFrame(df['meta'].str.split().values.tolist())

answered Jan 09 '18 at 11:53

mikkokotila

1,253
11
14

1

I got an error but I resolved it by removing the `str.split()`. This was much simpler and has the advantage if you don't know the number of items in your list. – otteheng Jan 11 '18 at 16:29
There seems to be a syntactically simpler way, and therefore easier to remember, as opposed to the proposed solutions. Really? Because this is practically identical to the top answer which was posted years earlier. The only difference is the part which isn't related to this specific question. – AMC May 01 '20 at 11:18
in many cases, you can replace `.values.tolist()` with simply `.to_numpy()`, which Pandas also recommends. While you're at it, you can slap a `.astype(int)` or whatever if you need – crypdick Oct 08 '20 at 20:45

score 10 · Answer 5 · edited Jul 24 '21 at 13:29

The previous solutions didn't work for me since I have nan observations in my dataframe. In my case df2[['team1','team2']] = pd.DataFrame(df2.teams.values.tolist(), index= df2.index) yields:

object of type 'float' has no len()

I solve this using a list comprehension. Here is the replicable example:

import pandas as pd
import numpy as np
d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
            ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
df2.loc[2,'teams'] = np.nan
df2.loc[4,'teams'] = np.nan
df2

Output:

        teams
0   [SF, NYG]
1   [SF, NYG]
2   NaN
3   [SF, NYG]
4   NaN
5   [SF, NYG]
6   [SF, NYG]

df2['team1']=np.nan
df2['team2']=np.nan

Solving with a list comprehension,

for i in [0,1]:
    df2['team{}'.format(str(i+1))]=[k[i] if isinstance(k,list) else k for k in df2['teams']]

df2

yields:

    teams   team1   team2
0   [SF, NYG]   SF  NYG
1   [SF, NYG]   SF  NYG
2   NaN        NaN  NaN
3   [SF, NYG]   SF  NYG
4   NaN        NaN  NaN
5   [SF, NYG]   SF  NYG
6   [SF, NYG]   SF  NYG

score 8 · Answer 6 · edited Jul 24 '21 at 13:33

8

List comprehension

A simple implementation with list comprehension (my favorite)

df = pd.DataFrame([pd.Series(x) for x in df.teams])
df.columns = ['team_{}'.format(x+1) for x in df.columns]

Timing on output:

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 2.71 ms

Output:

team_1    team_2
0    SF    NYG
1    SF    NYG
2    SF    NYG
3    SF    NYG
4    SF    NYG
5    SF    NYG
6    SF    NYG

edited Jul 24 '21 at 13:33

Peter Mortensen

30,030
21
100
124

answered May 24 '20 at 09:28

Talis

237
3
12

This kind of handles lists of different lengths - which is an improvement over many other answers, but results in items not being in their own columns. – Isaac May 26 '20 at 10:47
This solution is not based on pandas – raphaelauv Sep 10 '20 at 11:01

score 6 · Answer 7 · edited Jul 24 '21 at 13:35

Here's another solution using df.transform and df.set_index:

>>> from operator import itemgetter
>>> df['teams'].transform({'item1': itemgetter(0), 'item2': itemgetter(1)})

  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

Which of course can be generalized as:

>>> indices = range(len(df['teams'][0]))

>>> df['teams'].transform({f'team{i+1}': itemgetter(i) for i in indices})

  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

This approach has the added benefit of extracting the desired indices:

>>> df
                 teams
0  [SF, NYG, XYZ, ABC]
1  [SF, NYG, XYZ, ABC]
2  [SF, NYG, XYZ, ABC]
3  [SF, NYG, XYZ, ABC]
4  [SF, NYG, XYZ, ABC]
5  [SF, NYG, XYZ, ABC]
6  [SF, NYG, XYZ, ABC]

>>> indices = [0, 2]
>>> df['teams'].transform({f'team{i+1}': itemgetter(i) for i in indices})

  team1 team3
0    SF   XYZ
1    SF   XYZ
2    SF   XYZ
3    SF   XYZ
4    SF   XYZ
5    SF   XYZ
6    SF   XYZ

score 5 · Answer 8 · answered Nov 03 '19 at 06:07

Based on the previous answers, here is another solution which returns the same result as df2.teams.apply(pd.Series) with a much faster run time:

pd.DataFrame([{x: y for x, y in enumerate(item)} for item in df2['teams'].values.tolist()], index=df2.index)

Timings:

In [1]:
import pandas as pd
d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
                ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
df2 = pd.concat([df2]*1000).reset_index(drop=True)

In [2]: %timeit df2['teams'].apply(pd.Series)

8.27 s ± 2.73 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit pd.DataFrame([{x: y for x, y in enumerate(item)} for item in df2['teams'].values.tolist()], index=df2.index)

35.4 ms ± 5.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

score 3 · Answer 9 · answered Sep 12 '21 at 03:36

I would like to recommend a more efficient and Pythonic way.

First define the DataFrame as original post:

df = pd.DataFrame({"teams": [["SF", "NYG"] for _ in range(7)]})

My solution:

%%timeit
df['team1'], df['team2'] = zip(*list(df['teams'].values))
>> 761 µs ± 8.35 µs per loop

In comparison, the most upvoted solution:

%%timeit
df[['team1','team2']] = pd.DataFrame(df.teams.tolist(), index=df.index)
df = pd.DataFrame(df['teams'].to_list(), columns=['team1','team2'])
>> 1.31 ms ± 11.2 µs per loop

My solution saves 40% time and is much shorter. The only thing you need to remember is how to unpack and reshape a two-dimension list by using zip(*list).

score -1 · Answer 10 · answered Feb 23 '22 at 09:30

you can try to use two times of apply to create new column 'team1' and 'team2' in your df

df = pd.DataFrame({"teams": [["SF", "NYG"] for _ in range(7)]})
df["team1"]=df['teams'].apply(lambda x: x[0]  )
df["team2"]=df['teams'].apply(lambda x: x[1]  )
df

enter image description here

Split a Pandas column of lists into multiple columns

10 Answers10

Linked

Related