I have a data set of movies which has 28 columns. One of them is genres. For each row in this data set, the value for column genres is of the form "Action|Animation|Comedy|Family|Fantasy". I want to encode them using pandas.get_dummies() but since the columns have multiple values, how to deal with such conditions? Additinal information on below link(question moved from stackoverflow) https://stackoverflow.com/q/40331558/4028904
Asked
Active
Viewed 1.6k times
11
-
I just want to ask whether in this case if we are doing a linear regression(with these as independent variables) should we drop one of the dummy variables or not (sorry cant comment due to my reputation ) – Ankit Kumar Jun 29 '18 at 19:46
1 Answers
12
I'm starting with the following dataset:
import pandas as pd
data = pd.DataFrame({'title': ['Avatar', 'Pirates', 'Spectre', 'Batman'],
'genres': ['Action|Adventure|Fantasy|Sci-Fi',
'Action|Adventure|Fantasy',
'Action|Adventure|Thriller',
'Action|Thriller']},
columns=['title', 'genres'])
title genres
0 Avatar Action|Adventure|Fantasy|Sci-Fi
1 Pirates Action|Adventure|Fantasy
2 Spectre Action|Adventure|Thriller
3 Batman Action|Thriller
First, you want to have your data in a structure pairing titles with one genre at a time, multiple rows per title. You can get it in a series like this:
cleaned = data.set_index('title').genres.str.split('|', expand=True).stack()
title
Avatar 0 Action
1 Adventure
2 Fantasy
3 Sci-Fi
Pirates 0 Action
1 Adventure
2 Fantasy
Spectre 0 Action
1 Adventure
2 Thriller
Batman 0 Action
1 Thriller
dtype: object
(There's an extra index level that we don't want, but we'll get rid of it soon.) get_dummies will now work, but it only works on one row at a time, so we need to re-aggregate the titles:
pd.get_dummies(cleaned, prefix='g').groupby(level=0).sum()
g_Action g_Adventure g_Fantasy g_Sci-Fi g_Thriller
title
Avatar 1.0 1.0 1.0 1.0 0.0
Batman 1.0 0.0 0.0 0.0 1.0
Pirates 1.0 1.0 1.0 0.0 0.0
Spectre 1.0 1.0 0.0 0.0 1.0
philh
- 221
- 1
- 5
-
3Here is a bit more efficient method:
x = data.set_index('title').genres.str.split(r'|', expand=True).stack().reset_index(level=1, drop=True).to_frame('genre'); pd.get_dummies(x, prefix='g', columns=['genre']).groupby(level=0).sum()– MaxU - stand with Ukraine Oct 31 '16 at 22:51 -
@MaxU yes indeed, thanks. I didn't know about the
expandarg. I've edited a variant on that into the post. (It seemed cleaner to keep the intermediate value as a series, not a dataframe.) – philh Nov 01 '16 at 19:05 -
Thank you for the great tip. Can you please elaborate more on the part where there are more than 10+ different values (basically dynamic field with increasing numbers with time) in the field. For Example: "Competitors" Field in the dataset of Deals made by an Organisation. In this case, each deal contains multiple competitors like ['12334', "Amazon; Microsoft; Gartner"] ['12334', "Amazon; Google"] Unique competitors will be going to increase, probably each time some new competitor will be added to the dataset. In this case above method would go for a toss as the number of fields will now n – Abhishek Jain Jul 19 '18 at 01:01
-
Would you suggest the same method if instead of 5 genres there are more than 100? – Joe Oct 02 '18 at 09:02