check element-wise for existence of string

Question

I'm looking for a way to check whether one string can be found in another string. str.contains only takes a fixed string pattern as argument, I'd rather like to have an element-wise comparison between two string columns.

import pandas as pd

df = pd.DataFrame({'long': ['sometext', 'someothertext', 'evenmoretext'],
               'short': ['some', 'other', 'stuff']})


# This fails:
df['short_in_long'] = df['long'].str.contains(df['short'])

Expected Output:

[True, True, False]

You've accepted an answer that is a near carbon copy of another. Not sure that kind of thing should be encouraged. Just FYI. — cs95, Mar 15 '19 at 13:51
I accepted the other answer only because your (excellent one) did not work in my actual case. So the other seems to be more general. — E. Sommer, Mar 15 '19 at 13:53
Not sure which version of the answer you're referring to. The other answer will also fail from a TypeError. And like I replied to your other comment, the initial edit with `str.contains` is wrong, because the check is *not* element wise. For example, using the contains solution, you will search for "some" across all rows, when it should have checked just the first. General, but completely wrong. — cs95, Mar 15 '19 at 13:57
You don't have to accept my answer, but you should check/verify the correctness of the solutions that you decide to accept, general or not... That's all. Have a nice day :) — cs95, Mar 15 '19 at 13:58
I did check them obviously, and I accepted the final version, not the initial (wrong) one. You have a nice day too :) — E. Sommer, Mar 15 '19 at 14:00

jezrael · Accepted Answer · 2019-03-15T13:27:22.367

5

Use list comprehension with zip:

df['short_in_long'] = [b in a for a, b in zip(df['long'], df['short'])]

print (df)
            long  short  short_in_long
0       sometext   some           True
1  someothertext  other           True
2   evenmoretext  stuff          False

edited Mar 15 '19 at 13:27

answered Mar 15 '19 at 13:23

jezrael

729,927
78
1,141
1,090

score 3 · Answer 2 · answered Mar 15 '19 at 13:25

This is a prime use case for a list comprehension:

# df['short_in_long'] = [y in x for x, y in df[['long', 'short']].values.tolist()]
df['short_in_long'] = [y in x for x, y in df[['long', 'short']].values]
df

            long  short  short_in_long
0       sometext   some           True
1  someothertext  other           True
2   evenmoretext  stuff          False

List comprehensions are usually faster than string methods because of lesser overhead. See For loops with pandas - When should I care?.

If your data contains NaNs, you can call a function with error handling:

def try_check(haystack, needle):
    try:
        return needle in haystack
    except TypeError:
        return False

df['short_in_long'] = [try_check(x, y) for x, y in df[['long', 'short']].values]

score 3 · Answer 3 · answered Mar 15 '19 at 13:56

3

Check with numpy, it is row-wise :-) .

np.core.char.find(df.long.values.astype(str),df.short.values.astype(str))!=-1
Out[302]: array([ True,  True, False])

answered Mar 15 '19 at 13:56

BENY

296,997
19
147
204

I used `numpy` a well. I was trying `df['short_in_long'] = np.where(df['short'].str.contains(df['long']), True, False)`. Why does this not work row-wise? – Erfan Mar 15 '19 at 14:22
@Erfan isin will not check the partial match :-) – BENY Mar 15 '19 at 14:23
Sorry, that was my second try. I edited my comment. – Erfan Mar 15 '19 at 14:24
@Erfan two part , `str.contain` do not accept `Series`, second , if you using join string with `sep = '|'` when the 3rd row have any partial string like `some` and `other`, the 3rd row will return as True under `str.contain` – BENY Mar 15 '19 at 14:29
Thank you, so when using `sep = '|'` It would check each row over the whole column. – Erfan Mar 15 '19 at 14:33
1

@Erfan yes that is why when do rowwise check like 1-1 , we can not using str.contains – BENY Mar 15 '19 at 14:34

Loochie · Answer 4 · 2019-03-15T16:23:33.823

1

Also,

df['short_in_long'] = df['long'].str.contains('|'.join(df['short'].values))

Update : I misinterpreted the problem. Here is the corrected version:

df['short_in_long'] = df['long'].apply(lambda x: True if x[1] in x[0] else False, axis =1)

edited Mar 15 '19 at 16:23

answered Mar 15 '19 at 13:49

Loochie

2,066
10
18

1

It was my first wrong answer. OP need check values per rows. – jezrael Mar 15 '19 at 13:51
As @jezrael mentioned , op want a row-wise check (1-1 check ), not (1-n) check – BENY Mar 15 '19 at 14:01
Thanks. I have corrected my code for one-to-one row-wise checking. – Loochie Mar 15 '19 at 16:24

check element-wise for existence of string

4 Answers4

Linked

Related