11

I have a data set of movies which has 28 columns. One of them is genres. For each row in this data set, the value for column genres is of the form "Action|Animation|Comedy|Family|Fantasy". I want to encode them using pandas.get_dummies() but since the columns have multiple values, how to deal with such conditions? Additinal information on below link(question moved from stackoverflow) https://stackoverflow.com/q/40331558/4028904

aks_Nin
  • 111
  • 1
  • 1
  • 4
  • I just want to ask whether in this case if we are doing a linear regression(with these as independent variables) should we drop one of the dummy variables or not (sorry cant comment due to my reputation ) – Ankit Kumar Jun 29 '18 at 19:46

1 Answers1

12

I'm starting with the following dataset:

import pandas as pd
data = pd.DataFrame({'title': ['Avatar', 'Pirates', 'Spectre', 'Batman'],
                 'genres': ['Action|Adventure|Fantasy|Sci-Fi',
                            'Action|Adventure|Fantasy',
                            'Action|Adventure|Thriller',
                            'Action|Thriller']},
                columns=['title', 'genres'])


     title                           genres
0   Avatar  Action|Adventure|Fantasy|Sci-Fi
1  Pirates         Action|Adventure|Fantasy
2  Spectre        Action|Adventure|Thriller
3   Batman                  Action|Thriller

First, you want to have your data in a structure pairing titles with one genre at a time, multiple rows per title. You can get it in a series like this:

cleaned = data.set_index('title').genres.str.split('|', expand=True).stack()


title
Avatar   0       Action
         1    Adventure
         2      Fantasy
         3       Sci-Fi
Pirates  0       Action
         1    Adventure
         2      Fantasy
Spectre  0       Action
         1    Adventure
         2     Thriller
Batman   0       Action
         1     Thriller
dtype: object

(There's an extra index level that we don't want, but we'll get rid of it soon.) get_dummies will now work, but it only works on one row at a time, so we need to re-aggregate the titles:

pd.get_dummies(cleaned, prefix='g').groupby(level=0).sum()


         g_Action  g_Adventure  g_Fantasy  g_Sci-Fi  g_Thriller
title
Avatar        1.0          1.0        1.0       1.0         0.0
Batman        1.0          0.0        0.0       0.0         1.0
Pirates       1.0          1.0        1.0       0.0         0.0
Spectre       1.0          1.0        0.0       0.0         1.0
philh
  • 221
  • 1
  • 5
  • 3
    Here is a bit more efficient method: x = data.set_index('title').genres.str.split(r'|', expand=True).stack().reset_index(level=1, drop=True).to_frame('genre'); pd.get_dummies(x, prefix='g', columns=['genre']).groupby(level=0).sum() – MaxU - stand with Ukraine Oct 31 '16 at 22:51
  • @MaxU yes indeed, thanks. I didn't know about the expand arg. I've edited a variant on that into the post. (It seemed cleaner to keep the intermediate value as a series, not a dataframe.) – philh Nov 01 '16 at 19:05
  • Thank you for the great tip. Can you please elaborate more on the part where there are more than 10+ different values (basically dynamic field with increasing numbers with time) in the field. For Example: "Competitors" Field in the dataset of Deals made by an Organisation. In this case, each deal contains multiple competitors like ['12334', "Amazon; Microsoft; Gartner"] ['12334', "Amazon; Google"] Unique competitors will be going to increase, probably each time some new competitor will be added to the dataset. In this case above method would go for a toss as the number of fields will now n – Abhishek Jain Jul 19 '18 at 01:01
  • Would you suggest the same method if instead of 5 genres there are more than 100? – Joe Oct 02 '18 at 09:02