39

I'm wondering if there is a more efficient way to use the str.contains() function in Pandas, to search for two partial strings at once. I want to search a given column in a dataframe for data that contains either "nt" or "nv". Right now, my code looks like this:

    df[df['Behavior'].str.contains("nt", na=False)]
    df[df['Behavior'].str.contains("nv", na=False)]

And then I append one result to another. What I'd like to do is use a single line of code to search for any data that includes "nt" OR "nv" OR "nf." I've played around with some ways that I thought should work, including just sticking a pipe between terms, but all of these result in errors. I've checked the documentation, but I don't see this as an option. I get errors like this:

    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-113-1d11e906812c> in <module>()
    3 
    4 
    ----> 5 soctol = f_recs[f_recs['Behavior'].str.contains("nt"|"nv", na=False)]
    6 soctol

    TypeError: unsupported operand type(s) for |: 'str' and 'str'

Is there a fast way to do this? Thanks for any help, I am a beginner but am LOVING pandas for data wrangling.

smci
  • 29,564
  • 18
  • 109
  • 144
M.A.Kline
  • 1,637
  • 2
  • 18
  • 29
  • *Note*: There is a solution [described by @unutbu](https://stackoverflow.com/a/48600345/9209546) which is more efficient than using `pd.Series.str.contains`. If performance is an issue, then this may be worth investigating. – jpp May 06 '18 at 22:19
  • Highly recommend checking out [this answer](https://stackoverflow.com/a/55335207) for more info on partial string search with multiple keywords/regexes. – cs95 Apr 07 '19 at 21:07
  • 1
    This is a simple typo, you just needed `..str.contains("nt|nv")` . The '|' bar goes inside the regex, not between two strings. – smci Feb 22 '20 at 01:27

2 Answers2

70

They should be one regular expression, and should be in one string:

"nt|nv"  # rather than "nt" | " nv"
f_recs[f_recs['Behavior'].str.contains("nt|nv", na=False)]

Python doesn't let you use the or (|) operator on strings:

In [1]: "nt" | "nv"
TypeError: unsupported operand type(s) for |: 'str' and 'str'
smci
  • 29,564
  • 18
  • 109
  • 144
Andy Hayden
  • 328,850
  • 93
  • 598
  • 514
  • 3
    thanks such a beauty!. caution though, there has to be no space between the pipe and the search terms! – kabrapankaj32 Apr 28 '17 at 08:00
  • 3
    @jaknap32: If you use `(?x)` modifier, you may add spaces wherever you want - `"(?x)nt | nv"` - (but if you have meaningful spaces in the pattern, you will need to escape them, as well as `#` char). See [Python `re.X` docs](https://docs.python.org/2/library/re.html#re.VERBOSE). Anyway, `n[tv]` is a better regex than `nt|nv`. – Wiktor Stribiżew Aug 27 '17 at 08:23
  • 1
    +1 for the "na=False" expression. My data has gaps in it and my string contains function won't work without it. – Arthur D. Howland Sep 07 '17 at 14:09
-1

I try this one and it's work:

df[df['Behavior'].str.contains('nt|nv', na=False)]