regexpresion cannot match special symbols in python

Question

I have a string: s = "we are \xaf\x06OK\x03family, good", and I want to substitute the \xaf,\x06 and \x03 with '', the regexpresion is pat = re.compile(r'\\[xX][0-9a-fA-F]+'), but it cannnot match anything. The code is in belows:

pat = re.compile(r'\\[xX][0-9a-fA-F]+')
s = "we are \xaf\x06OK\x03family, good"
print(s)
print(re.sub(pat, '', s))

The result is

we are ¯OKfamily, good we are ¯OKfamily, good,

But how can I get we are OK family, good

Look at [your string](https://ideone.com/xL4LHd), it has no *literal* `\xaf`, etc. [Your regex](https://regex101.com/r/atDC2s/1) does not find any match in it. — Wiktor Stribiżew, Jan 02 '19 at 08:12
If you want to remove all control characters, see https://stackoverflow.com/questions/4324790/removing-control-characters-from-a-string-in-python — Wiktor Stribiżew, Jan 02 '19 at 08:20

score 2 · Answer 1 · answered Jan 02 '19 at 08:20

2

you have to consider your input string s as raw string then this work, see below example:

pat = re.compile(r'\\[xX][0-9a-fA-F].')
s = r"we are \xaf\x06OK\x03family, good"
print(s)
print(re.sub(pat, '', s))

answered Jan 02 '19 at 08:20

Amit Nanaware

2,974
1
5
18

tripleee · Accepted Answer · 2019-01-02T09:45:53.133

You are making the basic but common mistake of confusing the representation of a string in Python source code with its actual value.

There are a number of escape codes in Python which do not represent themselves verbatim in regular strings in source code. For example, "\n" represents a single newline character, even though the Python notation occupies two characters. The backslash is used to introduce this notation. There are a number of dedicated escape codes like \r, \a, etc, and a generalized notation \x01 which allows you to write any character code in hex notation (\n is equivalent to \x0a, \r is equivalent to \x0d, etc). To represent a literal backslash character, you need to escape it with another backslash: "\\".

In a "raw string", no backslash escapes are supported; so r"\n" represents a string containing two characters, a literal backslash \ and a literal lowercase n. You could equivalently write "\\n" using non-raw string notation. The r prefix is not part of the string, it just tells Python how to interpret the string between the following quotes (i.e. no interpretation at all; every character represents itself verbatim).

It is not clear from your question which of these interpretations you actually need, so I will present solutions for both.

Here is a literal string containing actual backslashes:

pat = re.compile(r'\\[xX][0-9a-fA-F]+')
s = r"we are \xaf\x06OK\x03family, good"
print(s)
print(re.sub(pat, '', s))

Here is a string containing control characters and non-ASCII characters, and a regex substitution to remove them:

pat = re.compile(r'[\x00-\x1f\x80-\xff]+')
s = "we are \xaf\x06OK\x03family, good"
print(s)
print(re.sub(pat, '', s))

An additional complication is that the regex engine has its own internal uses for backslashes; we generally prefer to use raw strings for regexes in order to not have Python and the regex engine both interpreting backslashes (sometimes in incompatible ways).

Sorry, stupid copy/paste mistake, code updated, try now. – tripleee Jan 02 '19 at 09:46 — tripleee, Jan 02 '19 at 09:46

Tiw · Answer 3 · 2019-01-02T08:54:20.710

0

Another approach:

pat = re.compile(r'[^\w\d\s,]+')
s = "we are \xaf\x06OK\x03family, good"
print(' '.join(map(lambda x: x.strip(), pat.split(s))))
#=> we are OK family, good

Used reverse match, remove(split by) any characters that are not what you wanted.

edited Jan 02 '19 at 08:54

answered Jan 02 '19 at 08:28

Tiw

5,078
13
24
33

regexpresion cannot match special symbols in python

3 Answers3