How to exclude a character from a regex group?

Question

I want to strip all non-alphanumeric characters EXCEPT the hyphen from a string (python). How can I change this regular expression to match any non-alphanumeric char except the hyphen?

re.compile('[\W_]')

Thanks.

eldarerathis · Accepted Answer · 2010-11-05T18:14:59.297

41

You could just use a negated character class instead:

re.compile(r"[^a-zA-Z0-9-]")

This will match anything that is not in the alphanumeric ranges or a hyphen. It also matches the underscore, as per your current regex.

>>> r = re.compile(r"[^a-zA-Z0-9-]")
>>> s = "some#%te_xt&with--##%--5 hy-phens  *#"
>>> r.sub("",s)
'sometextwith----5hy-phens'

Notice that this also replaces spaces (which may certainly be what you want).

Edit: SilentGhost has suggested it may likely be cheaper for the engine to process with a quantifier, in which case you can simply use:

re.compile(r"[^a-zA-Z0-9-]+")

The + will simply cause any runs of consecutively matched characters to all match (and be replaced) at the same time.

edited Nov 05 '10 at 18:14

answered Nov 05 '10 at 17:54

eldarerathis

34,279
10
88
93

2

+1 You are right, removed my answer as yours covers I think what he wants...to match any character not a number, letter, nor hyphen. – 逆さま Nov 05 '10 at 17:57
quantifier would make this cheaper. – SilentGhost Nov 05 '10 at 18:05
@SilentGhost: Is it cheaper for the engine? I'm assuming you mean in terms of performance time. – eldarerathis Nov 05 '10 at 18:10
@eldarerathis: is it cheaper to replace 10 items or 3? the actual number of items will depend on the subject of course. – SilentGhost Nov 05 '10 at 18:11
@SilentGhost: Cheaper to replace but more expensive to match (most likely, since the engine will be matching then backtracking at the end of each run). Although the replace seems like it would be the more expensive of the two operations, so you're probably right in the overall sense. – eldarerathis Nov 05 '10 at 18:17

score 9 · Answer 2 · answered Nov 05 '10 at 17:57

9

\w matches alphanumerics, add in the hyphen, then negate the entire set: r"[^\w-]"

answered Nov 05 '10 at 17:57

Ned Batchelder

345,440
70
544
649

I'd assume underscore is considered non-alphanumeric ;) – SilentGhost Nov 05 '10 at 17:59
This won't match/replace the underscore character, which the OP's current regex does. – eldarerathis Nov 05 '10 at 18:00

How to exclude a character from a regex group?

2 Answers2

Linked