1

Assuming I have the following Pandas DataFrame:

     U     A         B
0  2000    10       20
1  3000    40        0 
2  2100    20       30
3  2500     0       30 
4  2600    30       40

How can I get the index of first row that both A and B have non-zero value and (A+B)/2 is larger than 15 ?

In this example, I would like to get 2 since it is the first row that have non-zero A and B column and avg value of 25 which is more than 15

Note that this DataFrame is huge, I am looking for the fastest way to the index value.

Chris
  • 12,661
  • 3
  • 20
  • 33
ARH
  • 1,315
  • 3
  • 16
  • 32
  • does [this](https://stackoverflow.com/questions/40660088/get-first-row-of-dataframe-in-python-pandas-based-on-criteria) answer your question? – Liad Kehila Dec 28 '20 at 22:33

3 Answers3

5

Lets try:

 df[(df.A.ne(0)&df.B.ne(0))&((df.A+df.B)/2).gt(15)].first_valid_index()
jakub
  • 13,953
  • 2
  • 30
  • 52
wwnde
  • 22,093
  • 5
  • 13
  • 27
1

I find more readable explicit variables, like:

AB2 = (df['A']+df['B'])/2 
filter = (df['A'] != 0) & (df['B'] != 0) & (AB2>15)
your_index = df[filter].index[0]

Performance For this use case (ridiculous dataset)

%%timeit
df[(df.A.ne(0)&df.B.ne(0))&((df.A+df.B)/2).gt(15)].first_valid_index()
**1.21 ms** ± 35.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
AB2 = (df['A']+df['B'])/2 
filter = (df['A'].ne(0)) & (df['B'].ne(0)) & (AB2>15)
df[filter].index[0]
**1.08 ms** ± 28.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
df.query("A!=0 and B!=0 and (A+B)/2 > 15").index[0]
**2.71 ms** ± 157 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Glauco
  • 1,097
  • 2
  • 8
  • 18
  • Does it perform as fast as the other answer ? – ARH Dec 28 '20 at 23:01
  • This is a little bit faster because sum of A+B is done one time: %%timeit df[(df.A.ne(0)&df.B.ne(0))&((df.A+df.B)/2).gt(15)].first_valid_index() 1.21 ms ± 35.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %%timeit AB2 = (df['A']+df['B'])/2 filter = (df['A'].ne(0)) & (df['B'].ne(0)) & (AB2>15) df[filter].index[0] 1.08 ms ± 28.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) – Glauco Dec 28 '20 at 23:25
0

If the dataframe is large, query might be faster:

df.query("A!=0 and B!=0 and (A+B)/2 > 15").index[0]

      2
sammywemmy
  • 22,944
  • 4
  • 14
  • 28