0

I have a pandas data frame as shown below,

df = pd.DataFrame({'group_1':['A','A','A', 'B', 'B','B','C','C','C','B'],
  'group_2':['a', 'b', 'a', 'b', 'a', 'b','a','c','b','c'],
  'col_1':np.random.randint(0,100,10),
  'col_2':np.random.randint(0,100,10),
   'col_3':np.random.randint(0,100,10)},
)

df
group_1 group_2 col_1   col_2   col_3
  A       a       37      79     83
  A       b       3       36      0
  A       a       4       11     55
  B       b       95      22     47
  B       a       44      66     78
  B       b       77      55     36
  C       a       28      94     35
  C       c       30      70     15
  C       b       35       5     33
  B       c       98      81     68

Now I'm adding another column whose values depend upon the values from group_2. In the original dataframe I'm working with, I'm adding this column by merging this dataframe with another dataframe. My dataframe now becomes

df
group_1 group_2 col_1   col_2   col_3   col_to_pick
  A        a      37      79      83      col_1
  A        b      3       36       0      col_2
  A        a      4       11      55      col_1
  B        b      95      22      47      col_2
  B        a      44      66      78      col_1
  B        b      77      55      36      col_2
  C        a      28      94      35      col_1
  C        c      30      70      15      col_3
  C        b      35       5      33      col_2
  B        c      98      81      68      col_3

For the purpose of reproducing above dataframe, col_to_pick can be added using

df['col_to_pick'] = np.where(df['group_2'] == 'a', 'col_1',np.where(df['group_2']=='b','col_2','col_3'))

Now I want to add a another column col_4 to my dataframe that takes numbers from col_1 or col_2 or col_3 depending upon col_to_pick value of the corresponding row. For a small dataframe like this, I can acheive this using something like

df['col_4'] = np.where(df['col_to_pick'] == 'col_1', df['col_1'],np.where(df['col_to_pick']=='col_2',df['col_2'],df['col_3']))

and my final dataframe becomes,

df
group_1 group_2 col_1   col_2   col_3   col_to_pick  col_4
  A        a      37      79      83      col_1         37     
  A        b      3       36       0      col_2         36     
  A        a      4       11      55      col_1         4
  B        b      95      22      47      col_2         22      
  B        a      44      66      78      col_1         44
  B        b      77      55      36      col_2         55
  C        a      28      94      35      col_1         28
  C        c      30      70      15      col_3         15
  C        b      35       5      33      col_2          5
  B        c      98      81      68      col_3         68

However, the dataframe I'm working with is very large and it has multiple columns like col_1, col_2 etc. So, methods like np.where are too difficult to implement. How can get col_4 values effeciently for large dataframe with large number of columns?

Edit: The reason I have added group_1 in example is, once I have got my col_4, I want to apply

df['col_4'] = df.groupby(['group_1','group_2'])['col_4'].transform('sum') 
MonkeyDLuffy
  • 456
  • 1
  • 3
  • 12

0 Answers0