1

I'm making a mask for my df (imported CSV file) based on string comparisons, but it seems that .contains works, but == doesn't.

This mask using .contains:

mask = (y_train['SEPSISPATOS'].str.contains('Yes')) | (y_train['SEPSHOCKPATOS'].str.contains('Yes')) | (y_train['OTHSYSEP'].str.contains('Sepsis')) | (y_train['OTHSESHOCK'].str.contains('Septic Shock'))

returns this output (note last line):

SEPSISPATOS   SEPSHOCKPATOS   OTHSYSEP   OTHSESHOCK           SEPSISPATOS
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'Septic Shock'      1            

while this other mask using direct comparison

mask = (y_train['SEPSISPATOS']=='Yes') | (y_train['SEPSHOCKPATOS']=='Yes') | (y_train['OTHSYSEP']=='Sepsis') | (y_train['OTHSESHOCK']=='Septic Shock')

returns:

SEPSISPATOS   SEPSHOCKPATOS   OTHSYSEP   OTHSESHOCK           SEPSISPATOS
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'Septic Shock'      0            

Wondering if I have bytes of strings rather than Python 3 Unicode strings, I have tried decoding (below). I have also tried .str.strip(). Neither of which worked. I need a fix that will let me use direct comparisons between strings for any columns containing text.

Edit re: utf-8 decoding

NSQIPdf_train = pd.read_csv("acs_nsqip_puf13_2.csv")
str_df=df.select_dtypes([np.object])
str_df=str_df.stack().str.decode('utf-8').unstack()
for col in str_df:
    NSQIPdf_train[col] = str_df[col]
y_train = NSQIPdf_train.loc[:,('SEPSISPATOS','SEPSHOCKPATOS', 'OTHSYSEP', 'OTHSESHOCK')]

This further compounded my problem... as the output became:

SEPSISPATOS   SEPSHOCKPATOS   OTHSYSEP   OTHSESHOCK        SEPSISPATOS
NaN            NaN            NaN        NaN               0
NaN            NaN            NaN        NaN               0
NaN            NaN            NaN        NaN               0
NaN            NaN            NaN        NaN               0          
  • `y_train['SEPSHOCKPATOS'].str.==('Yes')` is not valid Python syntax – IanS Jul 16 '19 at 15:32
  • besides, the parenthesis around `'Yes'` are a distraction – IanS Jul 16 '19 at 15:35
  • finally, `.str` is an accessor, so `y_train['SEPSHOCKPATOS'].str == 'Yes'` is not doing what you think it does (try printing `y_train['SEPSHOCKPATOS'].str`) – IanS Jul 16 '19 at 15:36
  • thank you, i've made those changes and updated the code (third block) in the original post to reflect it but y_train['SEPSISPATOS']=='Yes' doesn't seem to work either. same output as before. – michellemabelle Jul 16 '19 at 15:45
  • `.str.decode('utf-9')` should be utf-8 – IanS Jul 17 '19 at 05:48

2 Answers2

0

Use .str.decode('utf-8') to convert your byte values to strings before doing the comparison (see this question):

y_train['SEPSHOCKPATOS'].str.decode('utf-8') == 'Yes'

Note: I guess that .str.contains does a conversion under the hood.

IanS
  • 14,753
  • 9
  • 56
  • 81
  • I tried both y_train['SEPSHOCKPATOS'].str.decode('utf-8') == 'Yes', which resulted in the same output as before. Based on the link in your suggestion, I also tried decoding the entire df beforehand, which actually resulted in more problems. Printing out y_train['SEPSHOCKPATOS'].str.decode('utf-8') gives me NaNs. – michellemabelle Jul 16 '19 at 16:08
0

I'm a newbie to Pandas, but maybe str.fullmatch helps - a stricter version of str.contains that matches the whole string, thus,

y_train['SEPSHOCKPATOS'].str.fullmatch('Yes')

although note that this is actually checking against a regular expression, so take care if the string you're using contains any special characters.

Andrew Richards
  • 1,062
  • 8
  • 15