There are a few answers to related questions: Pandas split multiple columns of lists
but I was unable to apply it to my case.
Case: Each row is a search inquiry. I have a list of filters in a given df's cell. I want to expand it, to have each filter as a separate column, and if a user had said filter during the inquiry, set this filter=1. It's like dummy encoder, but instead of encoding 1 digit, there is a list.
You can run the code below to see what is desired end result. But my current solution is slow and might take up too much space at the end.
Here is what I did:
import pandas
from sklearn.model_selection import KFold
df_to_encode = pd.DataFrame({'filters': [[],[], [], [], [], [], [1059, 5254]],
'autobodytype': [[],[], [], [], [], [], []],
'interestsparams': [[329277, 1059], [], [329273, 1059, 10208, 329295], [329308, 18], [], [], []],
'interestscats': [['n106', 'n114', 'h20', 'n21', 'h24'], ['h111', 'h114', 'h27', 'h28'], ['n114', 'p116', 'h24', 'h25', 'h26', 'n40', 'h42', 'n85', 'h9'], ['n10', 'n14', 'h19', 'n20', 'n21', 'n25', 'p40', 'n9'], ['e10', 'h14', 'h19', 'h20', 'h81'], ['e20', 'h25'], []]})
kf = KFold(n_splits=2)
final_df = pd.DataFrame()
for fold, (_, idx) in enumerate(kf.split(df_to_encode)):
sample_df = df_to_encode.iloc[idx, :]
for c in cols_to_encode:
sample_df = sample_df.explode(c)
sample_df = pd.get_dummies(sample_df, columns=cols_to_encode)
sample_df = sample_df.groupby(by=sample_df.index, sort=False).max()
final_df = pd.concat([final_df, sample_df], 0).fillna(0).astype('uint8')
final_df
note: I have to use kfold, else the DF during dummy (I think) will get too big
note: I need to keep the index as I'm going to merge these columns with the rest of the DF I put aside
note: 10% of real data takes 300mb after encoding and 10 min to encode. I can wait, but I can't have crazy big DF. ANy tips are much appreciated.
Any suggestions you guys?
Thank you in advance.