40

Is there a concise way to express:

\w but without _

That is, "all characters included in \w, except _"

I'm asking this because I'm looking for the most concise way to express domain name validation. A domain name may include lowercase and uppercase letters, numbers, period signs and dashes, but no underscores. \w includes all of the above, plus an underscore. So, is there any way to "remove" an underscore from \w via regex syntax?

Edited: I'm asking about regex as used in PHP.

Thanks in advance!

Joseph Silber
  • 205,539
  • 55
  • 352
  • 286
Dimitri Vorontzov
  • 7,364
  • 12
  • 47
  • 76
  • 4
    Depends on the regex flavour. Which language are you using? The easiest way though would be to just use `[A-Za-z0-9]`. `\w` does (normally) **not** include dashes or periods. – Felix Kling Feb 13 '13 at 16:37
  • 1
    Depending on the flavor `\w` may support Unicode characters. Unless you are totally sure about what `\w` represent, it is best that you use the character class `[]` and list all of them out normally. – nhahtdh Feb 13 '13 at 16:38

7 Answers7

52

the following character class (in Perl)

[^\W_]

\W is the same as [^\w]

protist
  • 1,072
  • 7
  • 9
13

You could use a negative lookahead: (?!_)\w

However, I think writing [a-zA-Z0-9.-] is more readable.

Felix Kling
  • 756,363
  • 169
  • 1,062
  • 1,111
Bergi
  • 572,313
  • 128
  • 898
  • 1,281
  • 1
    That would be `(?!_)\w`, no? – Zero Piraeus Feb 13 '13 at 16:42
  • Look-around is slower than normal matching. May not matter here, though – nhahtdh Feb 13 '13 at 16:47
  • Thanks a lot, @Bergi - I have a question: wouldn't it be proper to write [a-zA-z0-9\.\-] - escaping period and dash – or is it wrong/unnecessary to escape them in this case? (I'm new to regex, and this may be a silly question...) – Dimitri Vorontzov Feb 13 '13 at 17:14
  • 1
    Not necessary: http://www.regular-expressions.info/charclass.html. Only characters that have a special meaning in a character class (`]\^-`) need to be escaped, and not when unambigous. – Bergi Feb 13 '13 at 17:20
  • Thank you very much, @Bergi! So, looking through the entire body of answers to my question, these solutions would all work: (?!_)\w --- [^\W_] --- or [A-Za-z0-9.-] --- am I right? – Dimitri Vorontzov Feb 13 '13 at 17:23
  • 1
    @Dimitri: Yes, depending on that `\w` means `[a-zA-Z0-9.-_]` in your regex flavour. – Bergi Feb 13 '13 at 17:27
3

To be on the safe side, usually, we will use character class:

[a-zA-Z0-9.-]

The regex "fragment" above match English alphabet, and digits, plus period . and dash -. It should work even with the most basic regex support.

Shorter may be better, but only if you know exactly what it represents.

I don't know what language you are using. In a lot of engines, \w is equivalent to [a-zA-Z0-9_] (some requires "ASCII mode" for this). However, some engine have Unicode support for regex, and may extend \w to match Unicode characters.

nhahtdh
  • 54,546
  • 15
  • 119
  • 154
3

If my understanding is right \w means [A-Za-z0-9_] period signs, dashes are not included.

info: http://en.wikipedia.org/wiki/Regular_expression#POSIX_character_classes

so I guess what you want is [a-zA-Z0-9.-]

Kent
  • 181,427
  • 30
  • 222
  • 283
1

Some regex flavours have a negative lookbehind syntax you might use:

\w(?<!_)
Zero Piraeus
  • 52,181
  • 26
  • 146
  • 158
  • 2
    Negative lookaheads are more widely supported than negative lookbehinds. – Joseph Silber Feb 13 '13 at 16:42
  • 1
    @JosephSilber True. Conceptually, I find "give me a word character ... but not an underscore" slightly easier than "the next thing I want shouldn't be an underscore ... otherwise, give me a word character" to follow, if negative lookbehinds *are* available, though. – Zero Piraeus Feb 13 '13 at 16:49
0

I would start with [^_], and then think of what else characters I need to deny. If you need to filter a keyboard input, it's quite simple to enumerate all the unwanted characters.

Zoltán Tamási
  • 11,389
  • 7
  • 58
  • 81
  • 2
    This is a very poor approach. Domain name has a defined set of allowed characters, so white-listing can be done. When you black list, you need to care about what Unicode character you need to deny also. – nhahtdh Feb 13 '13 at 16:50
  • @nhahtdh, I've taken into count that doamin names CAN have unicode characters (for example accented vowels). So I think it's quite hard to precisely form an ultimate correct white list solution. – Zoltán Tamási Feb 13 '13 at 17:25
  • There is specs for that - it is troublesome, but defined. People tend to forgot/overlook things when blacklisting. – nhahtdh Feb 13 '13 at 17:28
  • I agree, that's why I mentioned if the case is a keyboard input, because that can simplify things IMHO. – Zoltán Tamási Feb 13 '13 at 17:34
0

You can write something like this:

\([^\w]|_)\u

If you use preg_filter with this string any character in \w (excluding _ underscore) will be filtered.

MrD
  • 2,373
  • 3
  • 32
  • 57