31

I want to strip all non-alphanumeric characters EXCEPT the hyphen from a string (python). How can I change this regular expression to match any non-alphanumeric char except the hyphen?

re.compile('[\W_]')

Thanks.

atp
  • 28,762
  • 44
  • 122
  • 183

2 Answers2

41

You could just use a negated character class instead:

re.compile(r"[^a-zA-Z0-9-]")

This will match anything that is not in the alphanumeric ranges or a hyphen. It also matches the underscore, as per your current regex.

>>> r = re.compile(r"[^a-zA-Z0-9-]")
>>> s = "some#%te_xt&with--##%--5 hy-phens  *#"
>>> r.sub("",s)
'sometextwith----5hy-phens'

Notice that this also replaces spaces (which may certainly be what you want).


Edit: SilentGhost has suggested it may likely be cheaper for the engine to process with a quantifier, in which case you can simply use:

re.compile(r"[^a-zA-Z0-9-]+")

The + will simply cause any runs of consecutively matched characters to all match (and be replaced) at the same time.

eldarerathis
  • 34,279
  • 10
  • 88
  • 93
  • 2
    +1 You are right, removed my answer as yours covers I think what he wants...to match any character not a number, letter, nor hyphen. – 逆さま Nov 05 '10 at 17:57
  • quantifier would make this cheaper. – SilentGhost Nov 05 '10 at 18:05
  • @SilentGhost: Is it cheaper for the engine? I'm assuming you mean in terms of performance time. – eldarerathis Nov 05 '10 at 18:10
  • @eldarerathis: is it cheaper to replace 10 items or 3? the actual number of items will depend on the subject of course. – SilentGhost Nov 05 '10 at 18:11
  • @SilentGhost: Cheaper to replace but more expensive to match (most likely, since the engine will be matching then backtracking at the end of each run). Although the replace seems like it would be the more expensive of the two operations, so you're probably right in the overall sense. – eldarerathis Nov 05 '10 at 18:17
9

\w matches alphanumerics, add in the hyphen, then negate the entire set: r"[^\w-]"

Ned Batchelder
  • 345,440
  • 70
  • 544
  • 649