Deleting DataFrame row in Pandas based on column value

Question

I have the following DataFrame:

             daysago  line_race rating        rw    wrating
 line_date                                                 
 2007-03-31       62         11     56  1.000000  56.000000
 2007-03-10       83         11     67  1.000000  67.000000
 2007-02-10      111          9     66  1.000000  66.000000
 2007-01-13      139         10     83  0.880678  73.096278
 2006-12-23      160         10     88  0.793033  69.786942
 2006-11-09      204          9     52  0.636655  33.106077
 2006-10-22      222          8     66  0.581946  38.408408
 2006-09-29      245          9     70  0.518825  36.317752
 2006-09-16      258         11     68  0.486226  33.063381
 2006-08-30      275          8     72  0.446667  32.160051
 2006-02-11      475          5     65  0.164591  10.698423
 2006-01-13      504          0     70  0.142409   9.968634
 2006-01-02      515          0     64  0.134800   8.627219
 2005-12-06      542          0     70  0.117803   8.246238
 2005-11-29      549          0     70  0.113758   7.963072
 2005-11-22      556          0     -1  0.109852  -0.109852
 2005-11-01      577          0     -1  0.098919  -0.098919
 2005-10-20      589          0     -1  0.093168  -0.093168
 2005-09-27      612          0     -1  0.083063  -0.083063
 2005-09-07      632          0     -1  0.075171  -0.075171
 2005-06-12      719          0     69  0.048690   3.359623
 2005-05-29      733          0     -1  0.045404  -0.045404
 2005-05-02      760          0     -1  0.039679  -0.039679
 2005-04-02      790          0     -1  0.034160  -0.034160
 2005-03-13      810          0     -1  0.030915  -0.030915
 2004-11-09      934          0     -1  0.016647  -0.016647

I need to remove the rows where line_race is equal to 0. What's the most efficient way to do this?

Possible duplicate of [How to delete rows from a pandas DataFrame based on a conditional expression](http://stackoverflow.com/questions/13851535/how-to-delete-rows-from-a-pandas-dataframe-based-on-a-conditional-expression) — feetwet, Dec 08 '16 at 19:54

score 1368 · Accepted Answer · answered Aug 11 '13 at 14:38

1368

If I'm understanding correctly, it should be as simple as:

df = df[df.line_race != 0]

answered Aug 11 '13 at 14:38

tshauck

18,908
7
33
36

22

Will this cost more memory if `df` is large? Or, can I do it inplace? – Ziyuan May 08 '15 at 13:21
20

Just ran it on a `df` with 2M rows and it went pretty fast. – Dror Aug 11 '15 at 14:37
What if line_race has a space in it? Like 'line race'? – vfxGer Nov 03 '15 at 11:46
Just using the inverted condition you should use to select it! Thank you! – ssoto Nov 04 '15 at 16:06
72

@vfxGer if there is a space in the column, like 'line race', then you can just do `df = df[df['line race'] != 0]` – Paul Apr 27 '16 at 16:36
4

How would we modify this command if we wanted to delete the whole row if the value in question is found in any of columns in that row? – Alex Apr 27 '16 at 20:27
this worked for me - thanks. figure it doesn't hurt to add that using the == rather than != operator also does what you think it would do. (keeps rows that match, discards rows that differ). – 10mjg Sep 13 '16 at 19:27
5

Thanks! Fwiw, for me this had to be `df=df[~df['DATE'].isin(['2015-10-30.1', '2015-11-30.1', '2015-12-31.1'])]` – citynorman Dec 05 '16 at 14:47
1

If I should check not one column but 10 columns? – GML-VS May 16 '17 at 09:52
1

I want to delete if value is in list. something like `df = df[df.coll in rejectList]`? – user7867665 Apr 30 '18 at 16:43
2

if you have multiple conditions you can do that: `df = df[df['line race'] != 0| df['line race'].isin([1,2])]` use `|` as `or` and `&` as `and` you can use `isin(Iterable)` as python's `in` – Kubaba Sep 12 '18 at 08:00
Would anyone know what this syntax or operator is called? – Zorayr Aug 07 '20 at 17:50
do you mean not equal? – JPWilson Aug 31 '20 at 00:46
1

Does anyone know answer to the @user7867665 question. Even I'm stuck with that situation. Delete a row of dataframe "df" if value of column "A" is in list "lst"? – fellowCoder Oct 27 '20 at 21:45
how would you do this if you wanted to delete a column based on the value from a specific row (in my case its the last row with the sums for each col)? – A Bedoya Jan 11 '21 at 17:40
If you need to set values in the resulting dataframe, this solution will raise warnings of "trying to set values on a copy of a slice". You should add a `.copy()` statement in that case. – Kevin Liu May 19 '21 at 18:57

score 270 · Answer 2 · edited Dec 23 '15 at 10:45

270

But for any future bypassers you could mention that df = df[df.line_race != 0] doesn't do anything when trying to filter for None/missing values.

Does work:

df = df[df.line_race != 0]

Doesn't do anything:

df = df[df.line_race != None]

Does work:

df = df[df.line_race.notnull()]

edited Dec 23 '15 at 10:45

jezrael

729,927
78
1,141
1,090

answered Jun 30 '14 at 11:56

wonderkid2

4,196
1
18
20

5

how to do that if we don't know the column name? – Piyush S. Wanare Jul 03 '18 at 13:20
1

Could do `df = df[df.columns[2].notnull()]`, but one way or another you need to be able to index the column somehow. – erekalper Nov 09 '18 at 20:35
3

`df = df[df.line_race != 0]` drops the rows but also does not reset the index. So when you add another row in the df it may not add at the end. I'd recommend resetting the index after that operation (`df = df.reset_index(drop=True)`) – thenewjames Jul 17 '19 at 14:53
1

You should never compare to None with the `==` operator to start. https://stackoverflow.com/questions/3257919/what-is-the-difference-between-is-none-and-none – Bram Vanroy Apr 13 '20 at 08:25
1

For `None` values you can use `is` instead of `==` and `is not` instead of `!=`, like in this example `df = df[df.line_race is not None]` will work – Pradyut Jun 18 '21 at 05:11

score 107 · Answer 3 · answered Sep 28 '18 at 02:29

107

just to add another solution, particularly useful if you are using the new pandas assessors, other solutions will replace the original pandas and lose the assessors

df.drop(df.loc[df['line_race']==0].index, inplace=True)

answered Sep 28 '18 at 02:29

desmond

1,371
4
15
24

5

what is the purpose of writing index and inplace. Can anyone explain please ? – heman123 Nov 09 '18 at 06:05
4

[Read the docs!](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html#pandas.DataFrame.drop) – Federico Corazza Apr 15 '19 at 09:52
3

I think we'd need to `.reset_index()` as well if someone ends up using index accessors – Ayush Jun 08 '20 at 21:14
1

This indeed is correct answer using in data search and drop. Adding more explanation here. df['line_race']==0].index -> This will find the row index of all 'line_race' column having value 0. inplace=True -> this will modify original dataframe df. If you do not want to modify original dataframe, remove if(default is False) and store return value in another dataframe. – AndroDev Nov 22 '21 at 14:50

score 56 · Answer 4 · answered Jul 23 '19 at 08:00

56

If you want to delete rows based on multiple values of the column, you could use:

df[(df.line_race != 0) & (df.line_race != 10)]

To drop all rows with values 0 and 10 for line_race.

answered Jul 23 '19 at 08:00

Robvh

969
9
21

4

Is there a more efficient way to do this if you had multiple values you wanted to drop i.e., `drop = [0, 10]` and then something like `df[(df.line_race != drop)]` – mikey Jun 10 '20 at 19:32
2

good suggestion. ```df[(df.line_race != drop)]``` does not work, but I guess there is a possibility to do it more efficient. I do not have a solution right now, but if someone has, please let us now. – Robvh Jun 15 '20 at 11:39
8

df[~(df["line_race"].isin([0,10]))] https://stackoverflow.com/questions/38944673/how-to-drop-all-the-rows-based-on-multiple-values-found-in-the-fruit-column – Charlotte Deng Jul 07 '20 at 22:57

Phillip Cloud · Answer 5 · 2014-02-18T01:00:33.757

The best way to do this is with boolean masking:

In [56]: df
Out[56]:
     line_date  daysago  line_race  rating    raw  wrating
0   2007-03-31       62         11      56  1.000   56.000
1   2007-03-10       83         11      67  1.000   67.000
2   2007-02-10      111          9      66  1.000   66.000
3   2007-01-13      139         10      83  0.881   73.096
4   2006-12-23      160         10      88  0.793   69.787
5   2006-11-09      204          9      52  0.637   33.106
6   2006-10-22      222          8      66  0.582   38.408
7   2006-09-29      245          9      70  0.519   36.318
8   2006-09-16      258         11      68  0.486   33.063
9   2006-08-30      275          8      72  0.447   32.160
10  2006-02-11      475          5      65  0.165   10.698
11  2006-01-13      504          0      70  0.142    9.969
12  2006-01-02      515          0      64  0.135    8.627
13  2005-12-06      542          0      70  0.118    8.246
14  2005-11-29      549          0      70  0.114    7.963
15  2005-11-22      556          0      -1  0.110   -0.110
16  2005-11-01      577          0      -1  0.099   -0.099
17  2005-10-20      589          0      -1  0.093   -0.093
18  2005-09-27      612          0      -1  0.083   -0.083
19  2005-09-07      632          0      -1  0.075   -0.075
20  2005-06-12      719          0      69  0.049    3.360
21  2005-05-29      733          0      -1  0.045   -0.045
22  2005-05-02      760          0      -1  0.040   -0.040
23  2005-04-02      790          0      -1  0.034   -0.034
24  2005-03-13      810          0      -1  0.031   -0.031
25  2004-11-09      934          0      -1  0.017   -0.017

In [57]: df[df.line_race != 0]
Out[57]:
     line_date  daysago  line_race  rating    raw  wrating
0   2007-03-31       62         11      56  1.000   56.000
1   2007-03-10       83         11      67  1.000   67.000
2   2007-02-10      111          9      66  1.000   66.000
3   2007-01-13      139         10      83  0.881   73.096
4   2006-12-23      160         10      88  0.793   69.787
5   2006-11-09      204          9      52  0.637   33.106
6   2006-10-22      222          8      66  0.582   38.408
7   2006-09-29      245          9      70  0.519   36.318
8   2006-09-16      258         11      68  0.486   33.063
9   2006-08-30      275          8      72  0.447   32.160
10  2006-02-11      475          5      65  0.165   10.698

UPDATE: Now that pandas 0.13 is out, another way to do this is df.query('line_race != 0').

df.query looks very useful! Thanks! http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.query.html — fantabolous, Apr 06 '14 at 14:43
Good update for `query`. It allows for more rich selection criteria (eg. set-like operations like `df.query('variable in var_list')` where 'var_list' is a list of desired values) — philE, Sep 30 '14 at 20:32
how would this be achieved if the column name has a space in the name? — iNoob, Oct 05 '15 at 13:56
`query` is not very useful if the column name has a space in it. — Phillip Cloud, Oct 07 '15 at 19:12
I would avoid having spaces in the headers with something like this `df = df.rename(columns=lambda x: x.strip().replace(' ','_'))` — Scientist1642, Nov 28 '16 at 18:27
@Scientist1642 Same, but more concise: `df.columns = df.columns.str.replace(' ', '_')`. — RolfBly, Aug 13 '18 at 13:01
Spaces in columns are ok since Pandas release 0.25.0. Spaces are handled by surrounding the column name with backticks — labroid, Oct 17 '21 at 17:03

score 45 · Answer 6 · edited Sep 29 '21 at 06:32

45

In case of multiple values and str dtype

I used the following to filter out given values in a col:

def filter_rows_by_values(df, col, values):
    return df[~df[col].isin(values)]

Example:

In a DataFrame I want to remove rows which have values "b" and "c" in column "str"

df = pd.DataFrame({"str": ["a","a","a","a","b","b","c"], "other": [1,2,3,4,5,6,7]})
df
   str  other
0   a   1
1   a   2
2   a   3
3   a   4
4   b   5
5   b   6
6   c   7

filter_rows_by_values(df, "str", ["b","c"])

   str  other
0   a   1
1   a   2
2   a   3
3   a   4

edited Sep 29 '21 at 06:32

zr0gravity7

2,308
1
6
22

answered Jan 14 '21 at 16:23

Mo_Offical

451
3
3

2

This is a very useful little function. Thanks. – Erich Jan 27 '21 at 19:55
1

I also liked this. Might be totally obsolete, but added a small parameter that helps me decide whether select or delete it. Handy if you want to split a df in two: `def filter_rows_by_values(df, col, values, true_or_false = False): return df[df[col].isin(values) == true_or_false]` – Charles Apr 09 '21 at 09:44
You can replace `df[df[col].isin(values) == False]` by another negating condition using the tilde `~` invert operator `df[~df[col].isin(values)]`. See [How can I obtain the element-wise logical NOT of a pandas Series?](https://stackoverflow.com/questions/15998188/how-can-i-obtain-the-element-wise-logical-not-of-a-pandas-series) – Paul Rougieux Sep 22 '21 at 07:14

score 44 · Answer 7 · edited Nov 18 '21 at 20:32

44

Though the previous answer are almost similar to what I am going to do, but using the index method does not require using another indexing method .loc(). It can be done in a similar but precise manner as

df.drop(df.index[df['line_race'] == 0], inplace = True)

edited Nov 18 '21 at 20:32

micstr

4,613
6
43
71

answered Jan 12 '19 at 09:32

Loochie

2,066
10
18

2

In place solution better for large datasets or memory constrained. +1 – davmor Oct 24 '19 at 11:45

score 17 · Answer 8 · answered Mar 06 '17 at 12:50

17

The given answer is correct nontheless as someone above said you can use df.query('line_race != 0') which depending on your problem is much faster. Highly recommend.

answered Mar 06 '17 at 12:50

h3h325

661
7
19

Especially helpful if you have long `DataFrame` variable names like me (and, I'd venture to guess, everyone as compared to the `df` used for examples), because you only have to write it once. – ijoseph Apr 26 '18 at 17:52
Why would that be faster? You're taking a string and evaluating it as opposed to a normal expression. – Joshua Snider Jul 07 '21 at 03:24

score 8 · Answer 9 · answered Mar 06 '21 at 18:47

8

One of the efficient and pandaic way is using eq() method:

df[~df.line_race.eq(0)]

answered Mar 06 '21 at 18:47

ashkangh

1,556
1
5
9

4

Why not `df[df.line_race.ne(0)]`? – BSalita Jul 09 '21 at 07:54

score 7 · Answer 10 · answered Oct 10 '18 at 14:24

7

Another way of doing it. May not be the most efficient way as the code looks a bit more complex than the code mentioned in other answers, but still alternate way of doing the same thing.

  df = df.drop(df[df['line_race']==0].index)

answered Oct 10 '18 at 14:24

Amruth Lakkavaram

1,257
8
11

score 7 · Answer 11 · answered Mar 06 '21 at 17:57

7

Adding one more way to do this.

 df = df.query("line_race!=0")

answered Mar 06 '21 at 17:57

Tufail Waris

105
1
7

score 6 · Answer 12 · edited Jun 11 '20 at 22:22

I compiled and run my code. This is accurate code. You can try it your own.

data = pd.read_excel('file.xlsx')

If you have any special character or space in column name you can write it in '' like in the given code:

data = data[data['expire/t'].notnull()]
print (date)

If there is just a single string column name without any space or special character you can directly access it.

data = data[data.expire ! = 0]
print (date)

score 3 · Answer 13 · edited Feb 06 '20 at 03:33

Just adding another way for DataFrame expanded over all columns:

for column in df.columns:
   df = df[df[column]!=0]

Example:

def z_score(data,count):
   threshold=3
   for column in data.columns:
       mean = np.mean(data[column])
       std = np.std(data[column])
       for i in data[column]:
           zscore = (i-mean)/std
           if(np.abs(zscore)>threshold):
               count=count+1
               data = data[data[column]!=i]
   return data,count

score 3 · Answer 14 · answered Sep 30 '21 at 17:23

Just in case you need to delete the row, but the value can be in different columns. In my case I was using percentages so I wanted to delete the rows which has a value 1 in any column, since that means that it's the 100%

for x in df:
    df.drop(df.loc[df[x]==1].index, inplace=True)

Is not optimal if your df have too many columns.

Deleting DataFrame row in Pandas based on column value

14 Answers14

In case of multiple values and str dtype

Linked

Related