2

Suppose I have the following df:

df = pd.DataFrame({
    'col1':['x1','x2','x3'],
    'col2':['y1','y2','y3'],
    'col3':['z1','z2','z3'],
    'col4':['a1','b2','c3']
})

and a list of elements:

l = ['x1','x2','y3']

I want to search elements of l in all the columns of my df, as it stands from my list x1 and x2 appear in col1 and y3 is in col2, so I did:

df.loc[df['col1'].apply(lambda x: True if any(i in x for i in l) else False)|
       df['col2'].apply(lambda x: True if any(i in x for i in l) else False)]

which gives me

  col1 col2 col3 col4
0   x1   y1   z1   a1
1   x2   y2   z2   b2
2   x3   y3   z3   c3

as desired but the above method needs me to make a | operator for each column. So I wonder how can I do this iteration over all columns efficiently without using | for every column?

richardec
  • 14,202
  • 6
  • 23
  • 49
Wiliam
  • 846
  • 8
  • 19

1 Answers1

2

A much, much more efficient way of doing this would be to use numpy broadcasting.

row_mask = (df.to_numpy() == l[:, None, None]).sum(axis=0).any(axis=1)
filtered = df[row_mask]

Output:

>>> filtered
  col1 col2 col3 col4
0   x1   y1   z1   a1
1   x2   y2   z2   b2
2   x3   y3   z3   c3
richardec
  • 14,202
  • 6
  • 23
  • 49