0

I've struggeled with a problem and finally found a solution that works. That is, I found two. My question is what differs in the two approaces? I followed the instructions here: Get column name based on condition in pandas

In the first soulution with .dot I do not understand why the result is giving the same value to all cells.. When i removed the patient column, the results are correct. I want all values in the same cell.

I also tried the second solution with mapping in hope of not having to remove any columns. It worked, but here the values are given in a list (of lists?)

What is the difference in the two different formats, what can I do with them?

Data and code:

    import numpy as np
import pandas as pd

df= pd.DataFrame( {'patient' : [11,12,13,14,15],
      'K1' : [1,0,1,0,1],
      'K2' : [0,0,0,1,0],
      'K3' : [1,0,0,0,0],
      'K4' : [0,0,0,0,1],
      'K5' : [1,1,0,0,0] })
print(df)

#with 'patient' column and without
df2 = df.dot(df.columns + ';').str.rstrip(';')
print(df2)


#with 'patient' column
df_dict = dict(
    list(
        df.groupby(df.index)
        )
    )



for k, v in df_dict.items():
    check =v.columns[(v==1).any()]
    if len(check) > 0:
        print((k, check.to_list()))





Result:

    Solution 1

       patient  K1  K2  K3  K4  K5
0       11   1   0   1   0   1
1       12   0   0   0   0   1
2       13   1   0   0   0   0
3       14   0   1   0   0   0
4       15   1   0   0   1   0
0    patient;patient;patient;patient;patient;patien...
1    patient;patient;patient;patient;patient;patien...
2    patient;patient;patient;patient;patient;patien...
3    patient;patient;patient;patient;patient;patien...
4    patient;patient;patient;patient;patient;patien...
dtype: object
>>>

 
   
>>> Solution 1 without patient column:

0    K1;K3;K5
1          K5
2          K1
3          K2
4       K1;K4
dtype: object
>>> 

Solution 2
(0, ['K1', 'K3', 'K5'])
(1, ['K5'])
(2, ['K1'])
(3, ['K2'])
(4, ['K1', 'K4'])
  • It's not giving you the same value in all cells. However, in the first approach you should only be taking the dot of the `K` columns. What's happening currently is there are 11, 12, 13, 14, and 15 copies of the word `'patient'` that push the values you're interested in all the way to the end. `df2 = df[df.columns[1:]].dot(df.columns[1:] + ';').str.rstrip(';')` would work fine. – Henry Ecker Oct 05 '21 at 14:02

0 Answers0