403

I need to remove all characters from a string which aren't in a-z A-Z 0-9 set or are not spaces.

Does anyone have a function to do this?

kenorb
  • 137,499
  • 74
  • 643
  • 694
zuk1
  • 17,211
  • 21
  • 57
  • 63

7 Answers7

787

Sounds like you almost knew what you wanted to do already, you basically defined it as a regex.

preg_replace("/[^A-Za-z0-9 ]/", '', $string);
Louis
  • 4,262
  • 3
  • 38
  • 53
Chad Birch
  • 71,186
  • 23
  • 148
  • 148
  • 10
    zuk1: regexbuddy is a great help with that – relipse May 12 '14 at 17:13
  • 3
    Here's an example if you want to include the hyphen as an allowed character. I needed this because I needed to strip out disallowed characters from a Moodle username, based on email addresses: preg_replace("/[^a-z0-9_.@\-]/", '', $string); – Evan Donovan May 22 '14 at 15:17
  • 2
    Would this work exactly the same with apostrophes (single-quotes) around the regular expression, instead of quotation marks (double-quotes)? E.g: `preg_replace('/[^A-Za-z0-9 ]/', '', $string);` – 2540625 Mar 20 '15 at 17:46
  • 4
    We want explanation about this :) . People come here to see Why it is the way it is. Please consider Regex explanation too! Thanks – Pratik Dec 06 '15 at 10:44
  • A much better answer is below. – i-g Mar 04 '16 at 11:03
  • 4
    What if we want to keep accentued characters? – wonzbak Jun 23 '16 at 09:00
  • Does it matter single or double quote? – Ömer An May 15 '20 at 19:39
  • as noted by @wonzbak this does not keep accent chars – albanx Apr 13 '22 at 14:48
188

For unicode characters, it is :

preg_replace("/[^[:alnum:][:space:]]/u", '', $string);
voondo
  • 2,462
  • 1
  • 15
  • 19
  • hi voondo , what's with the /ui thing.. what do you call it ? can anyone please shed me some light. Thank you. – Kevin Florenz Daus Feb 28 '14 at 07:39
  • 5
    For clarification, they're called flags. They're put after the closing delimiter (in this case it's "/", but it could be "~" or "@" or whatever character you want to use as long as the opening and closing delimiters are the same) and change the behavior of the expression. – Doktor J Apr 13 '14 at 22:04
  • 1
    Btw, `\w` includes `\d` and so the `\d` is unnecessary. Also, this is wrong because it will also leave underscores in the resulting string (which is also included in `\w`). – smathy Aug 16 '14 at 20:42
  • 3
    There's still an error in this, the character classes need to be terminated with ':]' so the correct line would be: preg_replace("/[^[:alnum:][:space:]]/ui", '', $string); – h00ligan Nov 17 '14 at 14:03
  • 5
    Is the `i` flag really necessary here since `[:alnum:]` already covers both cases? – Eaten by a Grue Sep 25 '15 at 12:28
  • This solution worked until i migrated to php 7.3, replaced with ```preg_replace("/[^a-z\d\s]/iu", '', $str);``` – pgee70 Nov 17 '19 at 23:31
  • this solution works fine with php 8 as well. I think is the best – albanx Apr 13 '22 at 14:52
56

Regular expression is your answer.

$str = preg_replace('/[^a-z\d ]/i', '', $str);
  • The i stands for case insensitive.
  • ^ means, does not start with.
  • \d matches any digit.
  • a-z matches all characters between a and z. Because of the i parameter you don't have to specify a-z and A-Z.
  • After \d there is a space, so spaces are allowed in this regex.
topher
  • 14,465
  • 7
  • 54
  • 69
raspi
  • 5,684
  • 2
  • 31
  • 49
  • 4
    We want explanation about this :) . People come here to see Why it is the way it is. Please consider Regex explanation too! Not everyone is advanced enough to know what you wrote there without explanation. Thanks – Pratik Dec 06 '15 at 10:48
  • @PratikCJoshi The i stands for case insensitive. ^ means, does not start with. \d matches any digit. a-z matches all characters between a and z. Because of the i parameter you don't have to specify a-z and A-Z. After \d there is a space, so spaces are allows in this regex. – bart Feb 10 '16 at 04:21
  • 1
    People **don't** read comments as answer. Please update answer! – Pratik Feb 10 '16 at 08:54
31

If you need to support other languages, instead of the typical A-Z, you can use the following:

preg_replace('/[^\p{L}\p{N} ]+/', '', $string);
  • [^\p{L}\p{N} ] defines a negated (It will match a character that is not defined) character class of:
    • \p{L}: a letter from any language.
    • \p{N}: a numeric character in any script.
    • : a space character.
  • + greedily matches the character class between 1 and unlimited times.

This will preserve letters and numbers from other languages and scripts as well as A-Z:

preg_replace('/[^\p{L}\p{N} ]+/', '', 'hello-world'); // helloworld
preg_replace('/[^\p{L}\p{N} ]+/', '', 'abc@~#123-+=öäå'); // abc123öäå
preg_replace('/[^\p{L}\p{N} ]+/', '', '你好世界!@£$%^&*()'); // 你好世界

Note: This is a very old, but still relevant question. I am answering purely to provide supplementary information that may be useful to future visitors.

Jonathon
  • 14,740
  • 11
  • 73
  • 87
17

here's a really simple regex for that:

\W|_

and used as you need it (with a forward / slash delimiter).

preg_replace("/\W|_/", '', $string);

Test it here with this great tool that explains what the regex is doing:

http://www.regexr.com/

scrollup
  • 196
  • 3
  • 12
Alex Stephens
  • 2,811
  • 1
  • 33
  • 40
  • 1
    You still need the `/u` flag otherwise non-ascii letters are also removed. – Xeoncross Dec 30 '14 at 19:52
  • Neat [but would also match spaces](https://www.regex101.com/r/afwxAB/1) and if this is wanted, probably could double the performance by use of a *character class* and additional *quantifier* for *one or more* [`[\W_]+`](https://www.regex101.com/r/afwxAB/2) – bobble bubble Dec 31 '16 at 02:00
13
[\W_]+

 

$string = preg_replace("/[\W_]+/u", '', $string);

It select all not A-Z, a-z, 0-9 and delete it.

See example here: https://regexr.com/3h1rj

Intacto
  • 474
  • 3
  • 6
  • 1
    what does this regex /[\W_]+/u means ? – Ângelo Rigo Dec 04 '17 at 17:38
  • 1
    `\W` is the inverse of `\w` which are characters `A-Za-z0-9_`. So `\W` will match any character that is not `A-Za-z0-9_` and remove them. The `[]` is a [character set boundary](https://www.regular-expressions.info/charclass.html). The`+` is redundant on a character set boundary but normally means 1 or more character. The `u` flag expands the expression to include unicode character support, meaning it will not remove characters beyond character code 255 such as `ª²³µ` . Example of various usages https://3v4l.org/hSVV5 with unicode and ascii characters. – Will B. Apr 25 '19 at 14:33
3
preg_replace("/\W+/", '', $string)

You can test it here : http://regexr.com/

PASTAGA
  • 2,117
  • 2
  • 15
  • 30