re.sub erroring with "Expected string or bytes-like object"

Question

I have read multiple posts regarding this error, but I still can't figure it out. When I try to loop through my function:

def fix_Plan(location):
    letters_only = re.sub("[^a-zA-Z]",  # Search for all non-letters
                          " ",          # Replace all non-letters with spaces
                          location)     # Column and row to search    

    words = letters_only.lower().split()     
    stops = set(stopwords.words("english"))      
    meaningful_words = [w for w in words if not w in stops]      
    return (" ".join(meaningful_words))    

col_Plan = fix_Plan(train["Plan"][0])    
num_responses = train["Plan"].size    
clean_Plan_responses = []

for i in range(0,num_responses):
    clean_Plan_responses.append(fix_Plan(train["Plan"][i]))

Here is the error:

Traceback (most recent call last):
  File "C:/Users/xxxxx/PycharmProjects/tronc/tronc2.py", line 48, in <module>
    clean_Plan_responses.append(fix_Plan(train["Plan"][i]))
  File "C:/Users/xxxxx/PycharmProjects/tronc/tronc2.py", line 22, in fix_Plan
    location)  # Column and row to search
  File "C:\Users\xxxxx\AppData\Local\Programs\Python\Python36\lib\re.py", line 191, in sub
    return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object

If you are getting an error, *always post the full error including the stack trace*. — juanpa.arrivillaga, May 01 '17 at 22:48
Please `print(train["Plan"][i])` and see what it is. Do it before the call to `fix_Plan()` in the for loop. I don't think `train["Plan"][i]` is what you expected to be. — Taku, May 01 '17 at 22:50
It is a string from an excel document formatted like this: Video editing: Further develop video production skills using tools such as Wochit, Videolicious and iMovie. Develop a production plan specific to sports that matches effort to potential audience/impact. Expand HTML/CSS skills and identify one to two projects in Sports that could benefit from being presented in an HTML story then implement. — imanexcelnoob, May 01 '17 at 22:55
Are you *sure* it's a string? Try printing `type(train['Plan'][i])` — juanpa.arrivillaga, May 01 '17 at 22:57
C:\Users\xxxxx\AppData\Local\Programs\Python\Python36\lib\re.py looks like it is not string and re is complaining — oshaiken, May 01 '17 at 23:02
Ok, so apparently some of them are floats. How can I make them strings? — imanexcelnoob, May 01 '17 at 23:04
Well the majority of the types are strings, but there are a few that are floats. — imanexcelnoob, May 01 '17 at 23:07

score 149 · Accepted Answer · answered May 01 '17 at 23:08

149

As you stated in the comments, some of the values appeared to be floats, not strings. You will need to change it to strings before passing it to re.sub. The simplest way is to change location to str(location) when using re.sub. It wouldn't hurt to do it anyways even if it's already a str.

letters_only = re.sub("[^a-zA-Z]",  # Search for all non-letters
                          " ",          # Replace all non-letters with spaces
                          str(location))

answered May 01 '17 at 23:08

Taku

28,570
11
65
77

3

I wrote two notebooks on in Jupyter and one in Kaggle Kernels. Jupyter one works fine and produces correct output. Kaggle Notebook gives me an error and I followed your solution and the error was removed but now sentiment prediction result it wrong. – Zaira Zafar Apr 30 '18 at 13:43

score 22 · Answer 2 · edited Jul 27 '20 at 16:17

22

The simplest solution is to apply Python str function to the column you are trying to loop through.

If you are using pandas, this can be implemented as:

dataframe['column_name']=dataframe['column_name'].apply(str)

edited Jul 27 '20 at 16:17

mario

7,545
1
20
37

answered Nov 01 '19 at 07:30

msaif

221
2
2

5

I would suggest fill nan values with '' `dataframe['column_name'] = dataframe['column_name'].fillna('').apply(str)` because in most use cases people will not want nan to be literal 'nan' – lowzhao Apr 06 '20 at 02:41

score 2 · Answer 3 · edited Sep 06 '21 at 18:19

I had the same problem. And it's very interesting that every time I did something, the problem was not solved until I realized that there are two special characters in the string.

For example, for me, the text has two characters:

&lrm; _{(Left-to-Right Mark)} and &zwnj; _{(Zero-width non-joiner)}

The solution for me was to delete these two characters and the problem was solved.

import re
mystring = "&lrm;Some Time W&zwnj;e"
mystring  = re.sub(r"&lrm;", "", mystring)
mystring  = re.sub(r"&zwnj;", "", mystring)

I hope this has helped someone who has a problem like me.

score 0 · Answer 4 · answered Oct 27 '19 at 12:46

I suppose better would be to use re.match() function. here is an example which may help you.

import re
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
sentences = word_tokenize("I love to learn NLP \n 'a :(")
#for i in range(len(sentences)):
sentences = [word.lower() for word in sentences if re.match('^[a-zA-Z]+', word)]  
sentences

re.sub erroring with "Expected string or bytes-like object"

4 Answers4

Linked

Related