How to drop rows from pandas data frame that contains a particular string in a particular column?

Question

I have a very large data frame in python and I want to drop all rows that have a particular string inside a particular column.

For example, I want to drop all rows which have the string "XYZ" as a substring in the column C of the data frame.

Can this be implemented in an efficient way using .drop() method?

score 272 · Accepted Answer · edited Oct 30 '18 at 12:11

272

pandas has vectorized string operations, so you can just filter out the rows that contain the string you don't want:

In [91]: df = pd.DataFrame(dict(A=[5,3,5,6], C=["foo","bar","fooXYZbar", "bat"]))

In [92]: df
Out[92]:
   A          C
0  5        foo
1  3        bar
2  5  fooXYZbar
3  6        bat

In [93]: df[~df.C.str.contains("XYZ")]
Out[93]:
   A    C
0  5  foo
1  3  bar
3  6  bat

edited Oct 30 '18 at 12:11

jezrael

729,927
78
1,141
1,090

answered Feb 23 '15 at 17:49

Brian from QuantRocket

4,805
1
18
18

42

Although what you wrote is correct and more readable, a shorter method would be :`df[~df.C.str.contains("XYZ")]` – EdChum Feb 23 '15 at 17:58
1

When I do this it works perfectly, however it also does not show any rows in which the value was NaN. Is there a way to get those back so that the resulting data frame contains rows that do not contain the desired string and also NaN? – BenRichi_ Mar 12 '18 at 16:18
2

how would you do this if instead of "XYZ" you wanted to see if it contained anything inside of a large list of maybe a 1000 different things to look for. – 0004 Sep 14 '18 at 03:00
let's assume my column name is saved in a list in first position, i.e. columns_names[0] . How do I change the command? I tryed df[~df.columns_names[0].str.contains("XYZ")] but it does not work – Tms91 Nov 23 '19 at 18:28
What if the name of the column has a space in it? – A Bedoya Apr 06 '20 at 20:32
5

I get an error: `TypeError: bad operand type for unary ~: 'float'`, any ideas regarding this issue? – ah bon Jul 08 '20 at 02:34
2

The problem solved by adding `na=False` – ah bon Jul 08 '20 at 02:37
When I use '.' as symbol to drop, it drops the entire dataframe. How would you do it if you want to drop rows where a column includes a point? – Luca R Sep 12 '20 at 23:55
To remove rows which contains partial string, you can use str.contains(r'XYZ(?!$)') – DOT Nov 04 '21 at 14:14

score 137 · Answer 2 · answered Apr 13 '17 at 19:03

137

If your string constraint is not just one string you can drop those corresponding rows with:

df = df[~df['your column'].isin(['list of strings'])]

The above will drop all rows containing elements of your list

answered Apr 13 '17 at 19:03

Kenan

11,783
8
39
49

1

How would you do the inverse of this? I want to check if the column value contains any of the strings. pseudo: `for string in list_of_strings, check if column contains it` – radtek Apr 23 '17 at 05:39
7

Just remove the "~" df = df[df['your column'].isin(['list of strings'])] – Kenan Jun 26 '17 at 18:28
Already got that going, but thanks – radtek Jun 27 '17 at 17:54
3

What if we don't know the column? – Piyush S. Wanare Jul 03 '18 at 13:22
@PiyushS.Wanare if you know the column index you can always pass in df[df.columns[i]] where I is the column index – Kenan Jul 10 '18 at 13:25
2

How would you drop from multiple columns instead of just one? – Ali P May 23 '19 at 01:15
That's very tricky because the rows won't be the same for all columns, you can try `df[['list of cols']].isin([...])` – Kenan May 23 '19 at 17:46
2

This worked for me but I realized that 'list of strings' should be the exact string you want to remove – eafloresf Aug 05 '19 at 03:17
1

I don't know the reason, but does not work. No error message, gives back df without changes. – vesszabo Aug 25 '19 at 12:22

score 48 · Answer 3 · edited Feb 25 '19 at 07:28

48

This will only work if you want to compare exact strings. It will not work in case you want to check if the column string contains any of the strings in the list.

The right way to compare with a list would be :

searchfor = ['john', 'doe']
df = df[~df.col.str.contains('|'.join(searchfor))]

edited Feb 25 '19 at 07:28

Nihal

5,106
7
23
38

answered Aug 14 '17 at 18:58

Rupert Schiessl

749
5
10

score 25 · Answer 4 · answered Dec 07 '18 at 12:14

25

Slight modification to the code. Having na=False will skip empty values. Otherwise you can get an error TypeError: bad operand type for unary ~: float

df[~df.C.str.contains("XYZ", na=False)]

Source: TypeError: bad operand type for unary ~: float

answered Dec 07 '18 at 12:14

Devarshi Mandal

603
8
16

score 14 · Answer 5 · edited Apr 04 '18 at 09:00

14

new_df = df[df.C != 'XYZ']

Reference: https://chrisalbon.com/python/data_wrangling/pandas_dropping_column_and_rows/

edited Apr 04 '18 at 09:00

4b0

20,627
30
92
137

answered Apr 04 '18 at 08:58

Amy Annine

339
3
4

score 10 · Answer 6 · answered Aug 02 '18 at 17:36

10

The below code will give you list of all the rows:-

df[df['C'] != 'XYZ']

To store the values from the above code into a dataframe :-

newdf = df[df['C'] != 'XYZ']

answered Aug 02 '18 at 17:36

ak3191

553
5
14

1

This Syntax is way more easier to remember. Thanks – Saurabh Bade Jul 25 '20 at 17:14

score 3 · Answer 7 · answered Jun 12 '18 at 13:26

3

if you do not want to delete all NaN, use

df[~df.C.str.contains("XYZ") == True]

answered Jun 12 '18 at 13:26

Zhou Ruohua

31
1

How to drop rows from pandas data frame that contains a particular string in a particular column?

7 Answers7

Linked

Related