155

How to remove all special characters from string in R and replace them with spaces ?

Some special characters to remove are : ~!@#$%^&*(){}_+:"<>?,./;'[]-=

I've tried regex with [:punct:] pattern but it removes only punctuation marks.

Question 2 : And how to remove characters from foreign languages like : â í ü Â á ą ę ś ć ?

Answer : Use [^[:alnum:]] to remove~!@#$%^&*(){}_+:"<>?,./;'[]-= and use [^a-zA-Z0-9] to remove also â í ü Â á ą ę ś ć in regex or regexpr functions.

Solution in base R :

x <- "a1~!@#$%^&*(){}_+:\"<>?,./;'[]-=" 
gsub("[[:punct:]]", "", x)  # no libraries needed
Qbik
  • 5,460
  • 13
  • 52
  • 86

3 Answers3

247

You need to use regular expressions to identify the unwanted characters. For the most easily readable code, you want the str_replace_all from the stringr package, though gsub from base R works just as well.

The exact regular expression depends upon what you are trying to do. You could just remove those specific characters that you gave in the question, but it's much easier to remove all punctuation characters.

x <- "a1~!@#$%^&*(){}_+:\"<>?,./;'[]-=" #or whatever
str_replace_all(x, "[[:punct:]]", " ")

(The base R equivalent is gsub("[[:punct:]]", " ", x).)

An alternative is to swap out all non-alphanumeric characters.

str_replace_all(x, "[^[:alnum:]]", " ")

Note that the definition of what constitutes a letter or a number or a punctuatution mark varies slightly depending upon your locale, so you may need to experiment a little to get exactly what you want.

Richie Cotton
  • 113,548
  • 43
  • 231
  • 352
  • 14
    nice answers +1 You may want to replace the `" "` with `""` otherwise you have empty white space in the string. – Tyler Rinker Apr 24 '12 at 10:50
  • 8
    @TylerRinker: True, though QBik did specifically ask for spaces. – Richie Cotton Apr 24 '12 at 13:04
  • 9
    How to delete remowe those crazy characters : `â í ü Â á` ? – Qbik Apr 24 '12 at 15:51
  • 1
    You need to read up on regular expressions. Start with the link in my answer, and then read `?regex` and `?regexpr`. – Richie Cotton Apr 24 '12 at 16:10
  • 3
    Try replacing `[^[:alnum:]]` with `[^a-zA-Z0-9]` or possibly `\\W`. – Richie Cotton Apr 24 '12 at 17:33
  • 1
    The `[[:punct:]]` option does not work sometimes e.g:`str_replace_all("jef|bezos", "[[:punct:]]", " ")` but `[^[:alnum:]]` works. It is safer for strict situations. – Suat Atan PhD Mar 25 '20 at 19:49
  • I have the problem that I can't even save x as stated in your answer @RichieCotton. If x contains any special character my R Studio does nothing else than telling me that I used an unknown escape sequence (none). – maxgotstuck Mar 14 '22 at 10:42
  • @maxgotstuck You need to double your backslashes need to escape them. For example, instead of including `\r`, include `\\r`. – Richie Cotton Mar 29 '22 at 01:13
57

Instead of using regex to remove those "crazy" characters, just convert them to ASCII, which will remove accents, but will keep the letters.

astr <- "Ábcdêãçoàúü"
iconv(astr, from = 'UTF-8', to = 'ASCII//TRANSLIT')

which results in

[1] "Abcdeacoauu"
Felipe Alvarenga
  • 2,442
  • 15
  • 35
  • 2
    I had to add `iconv(astr, from="UFT-8", to="ASCII//TRANSLIT")`, otherwise with french characters like `ç` it goes a bit funny. – Duccio A Aug 21 '18 at 08:42
  • 1
    No workee in USA on Windoze, I get an 'NA' ... `Sys.getlocale()` ... ```LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252``` – mshaffer Apr 18 '21 at 18:53
  • 1
    This worked: ```str = "Ábcdêãçoàúü"; str = iconv(str, from = '', to = 'ASCII//TRANSLIT'); str;``` – mshaffer Apr 18 '21 at 19:01
  • Also It works in windows `readlines( data, encoding="UTF-8")` – alittleloopy Aug 22 '21 at 00:48
13

Convert the Special characters to apostrophe,

Data  <- gsub("[^0-9A-Za-z///' ]","'" , Data ,ignore.case = TRUE)

Below code it to remove extra ''' apostrophe

Data <- gsub("''","" , Data ,ignore.case = TRUE)

Use gsub(..) function for replacing the special character with apostrophe

pirho
  • 10,579
  • 12
  • 42
  • 61
UMESH NITNAWARE
  • 131
  • 1
  • 3
  • What about pre-existing code or comments that *already* contains `''` two consecutive apostrophes? Won't those be affected? Don't those need to be protected/escaped first? – jubilatious1 Sep 18 '21 at 15:13
  • Also '/' gets through, when it shouldn't. – Ruben Apr 07 '22 at 15:40