Duplicating rows where a cell contains multiple pieces of data

Question

I would like to take a dataframe and duplicate certain rows. One column, called name, may have multiple names. An example dataframe is contructed below:

data = [
    ['Joe', '17-11-2018', '2'],
    ['Karen', '17-11-2018', '4'],
    ['Bill, Avery', '17-11-2018', '6'],
    ['Sam', '18-11-2018', '4'],
    ['Alex, Frank', '18-11-2018', '6'],
    ['Chris', '18-11-2018', '8'],
]
df = pd.DataFrame(data, columns = ['name','date','number'])

This yields the following dataframe:

          name        date number
0          Joe  17-11-2018      2
1        Karen  17-11-2018      4
2  Bill, Avery  17-11-2018      6
3          Sam  18-11-2018      4
4  Alex, Frank  18-11-2018      6
5        Chris  18-11-2018      8

I would like to take all rows where there are multiple names (comma-separated) and duplicate them for each individual name. The resulting dataframe should look like this:

    name        date number
0    Joe  17-11-2018      2
1  Karen  17-11-2018      4
2   Bill  17-11-2018      6
3  Avery  17-11-2018      6
4    Sam  18-11-2018      4
5   Alex  18-11-2018      6
6  Frank  18-11-2018      6
7  Chris  18-11-2018      8

score 3 · Accepted Answer · answered May 14 '19 at 22:46

3

After str.split , it become a unnest problem

df['name']=df.name.str.split(',')

unnesting(df,['name'])
Out[97]: 
     name        date number
0     Joe  17-11-2018      2
1   Karen  17-11-2018      4
2    Bill  17-11-2018      6
2   Avery  17-11-2018      6
3     Sam  18-11-2018      4
4    Alex  18-11-2018      6
4   Frank  18-11-2018      6
5   Chris  18-11-2018      8

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx
    return df1.join(df.drop(explode, 1), how='left')

answered May 14 '19 at 22:46

BENY

296,997
19
147
204

I receive `ValueError: count < 0` as a response – Jack Walsh May 14 '19 at 23:00
@JackWalsh did you try our method with the sample data you provided to us ? If that is work for you as well, however, failed with your actual data , you should tell us what is the different between the sample data and your real data – BENY May 14 '19 at 23:03
I realized that the difference is that my dataset has some null values in the `name` equivalent column. I have implemented my own solution based off of yours, and for this reason will make this as the answer. – Jack Walsh May 14 '19 at 23:33

score 1 · Answer 2 · answered May 14 '19 at 22:47

1

Jack. I don't use dataframes much, but the following code should work before df = pd.DataFrame(data, columns = ['name','date','number'])

new_data = []
for item in data:
    if "," in item[0]:
        new_data.append([item[0].split(", ")[0], item[1], item[2]])
        new_data.append([item[0].split(", ")[1], item[1], item[2]])
    else:
        new_data.append(item)

answered May 14 '19 at 22:47

TechPerson

179
8

This would've worked well, but the dataframe I provided was an example. My real dataset is not created in the same way. I was able to use an approach similar to this using `.iterrows()` – Jack Walsh May 14 '19 at 23:34

score 0 · Answer 3 · answered May 14 '19 at 22:50

0

For a string with a separator you can use the following function found in this answer:

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})

explode_str(df, 'name', ',')

     name        date number
0     Joe  17-11-2018      2
1   Karen  17-11-2018      4
2    Bill  17-11-2018      6
2   Avery  17-11-2018      6
3     Sam  18-11-2018      4
4    Alex  18-11-2018      6
4   Frank  18-11-2018      6
5   Chris  18-11-2018      8

answered May 14 '19 at 22:50

Erfan

36,997
6
53
68

I receive the following error... `ValueError: count < 0` on the line which defines `i` – Jack Walsh May 14 '19 at 22:57
On your original data or your example dataframe? Because i'm not getting an error @JackWalsh – Erfan May 14 '19 at 23:02
I realized my original dataframe had a null value that was giving me an issue. While your solution and @WeNYoBen 's solutions were very similar, his was clearer to me. – Jack Walsh May 14 '19 at 23:36

Duplicating rows where a cell contains multiple pieces of data

3 Answers3

Linked