Comparison of strings not working in pandas dataframe?

Question

I'm making a mask for my df (imported CSV file) based on string comparisons, but it seems that .contains works, but == doesn't.

This mask using .contains:

mask = (y_train['SEPSISPATOS'].str.contains('Yes')) | (y_train['SEPSHOCKPATOS'].str.contains('Yes')) | (y_train['OTHSYSEP'].str.contains('Sepsis')) | (y_train['OTHSESHOCK'].str.contains('Septic Shock'))

returns this output (note last line):

SEPSISPATOS   SEPSHOCKPATOS   OTHSYSEP   OTHSESHOCK           SEPSISPATOS
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'Septic Shock'      1

while this other mask using direct comparison

mask = (y_train['SEPSISPATOS']=='Yes') | (y_train['SEPSHOCKPATOS']=='Yes') | (y_train['OTHSYSEP']=='Sepsis') | (y_train['OTHSESHOCK']=='Septic Shock')

returns:

SEPSISPATOS   SEPSHOCKPATOS   OTHSYSEP   OTHSESHOCK           SEPSISPATOS
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'Septic Shock'      0

Wondering if I have bytes of strings rather than Python 3 Unicode strings, I have tried decoding (below). I have also tried .str.strip(). Neither of which worked. I need a fix that will let me use direct comparisons between strings for any columns containing text.

Edit re: utf-8 decoding

NSQIPdf_train = pd.read_csv("acs_nsqip_puf13_2.csv")
str_df=df.select_dtypes([np.object])
str_df=str_df.stack().str.decode('utf-8').unstack()
for col in str_df:
    NSQIPdf_train[col] = str_df[col]
y_train = NSQIPdf_train.loc[:,('SEPSISPATOS','SEPSHOCKPATOS', 'OTHSYSEP', 'OTHSESHOCK')]

This further compounded my problem... as the output became:

SEPSISPATOS   SEPSHOCKPATOS   OTHSYSEP   OTHSESHOCK        SEPSISPATOS
NaN            NaN            NaN        NaN               0
NaN            NaN            NaN        NaN               0
NaN            NaN            NaN        NaN               0
NaN            NaN            NaN        NaN               0

`y_train['SEPSHOCKPATOS'].str.==('Yes')` is not valid Python syntax — IanS, Jul 16 '19 at 15:32
finally, `.str` is an accessor, so `y_train['SEPSHOCKPATOS'].str == 'Yes'` is not doing what you think it does (try printing `y_train['SEPSHOCKPATOS'].str`) — IanS, Jul 16 '19 at 15:36
thank you, i've made those changes and updated the code (third block) in the original post to reflect it but y_train['SEPSISPATOS']=='Yes' doesn't seem to work either. same output as before. — michellemabelle, Jul 16 '19 at 15:45

score 0 · Answer 1 · answered Jul 16 '19 at 16:02

0

Use .str.decode('utf-8') to convert your byte values to strings before doing the comparison (see this question):

y_train['SEPSHOCKPATOS'].str.decode('utf-8') == 'Yes'

Note: I guess that .str.contains does a conversion under the hood.

answered Jul 16 '19 at 16:02

IanS

14,753
9
56
81

I tried both y_train['SEPSHOCKPATOS'].str.decode('utf-8') == 'Yes', which resulted in the same output as before. Based on the link in your suggestion, I also tried decoding the entire df beforehand, which actually resulted in more problems. Printing out y_train['SEPSHOCKPATOS'].str.decode('utf-8') gives me NaNs. – michellemabelle Jul 16 '19 at 16:08

score 0 · Answer 2 · answered Nov 10 '20 at 17:17

I'm a newbie to Pandas, but maybe str.fullmatch helps - a stricter version of str.contains that matches the whole string, thus,

y_train['SEPSHOCKPATOS'].str.fullmatch('Yes')

although note that this is actually checking against a regular expression, so take care if the string you're using contains any special characters.

Comparison of strings not working in pandas dataframe?

2 Answers2