1

I have a string: s = "we are \xaf\x06OK\x03family, good", and I want to substitute the \xaf,\x06 and \x03 with '', the regexpresion is pat = re.compile(r'\\[xX][0-9a-fA-F]+'), but it cannnot match anything. The code is in belows:

pat = re.compile(r'\\[xX][0-9a-fA-F]+')
s = "we are \xaf\x06OK\x03family, good"
print(s)
print(re.sub(pat, '', s))

The result is

we are ¯OKfamily, good we are ¯OKfamily, good,

But how can I get we are OK family, good

littlely
  • 1,176
  • 3
  • 13
  • 29

3 Answers3

2

you have to consider your input string s as raw string then this work, see below example:

pat = re.compile(r'\\[xX][0-9a-fA-F].')
s = r"we are \xaf\x06OK\x03family, good"
print(s)
print(re.sub(pat, '', s))
Amit Nanaware
  • 2,974
  • 1
  • 5
  • 18
2

You are making the basic but common mistake of confusing the representation of a string in Python source code with its actual value.

There are a number of escape codes in Python which do not represent themselves verbatim in regular strings in source code. For example, "\n" represents a single newline character, even though the Python notation occupies two characters. The backslash is used to introduce this notation. There are a number of dedicated escape codes like \r, \a, etc, and a generalized notation \x01 which allows you to write any character code in hex notation (\n is equivalent to \x0a, \r is equivalent to \x0d, etc). To represent a literal backslash character, you need to escape it with another backslash: "\\".

In a "raw string", no backslash escapes are supported; so r"\n" represents a string containing two characters, a literal backslash \ and a literal lowercase n. You could equivalently write "\\n" using non-raw string notation. The r prefix is not part of the string, it just tells Python how to interpret the string between the following quotes (i.e. no interpretation at all; every character represents itself verbatim).

It is not clear from your question which of these interpretations you actually need, so I will present solutions for both.

Here is a literal string containing actual backslashes:

pat = re.compile(r'\\[xX][0-9a-fA-F]+')
s = r"we are \xaf\x06OK\x03family, good"
print(s)
print(re.sub(pat, '', s))

Here is a string containing control characters and non-ASCII characters, and a regex substitution to remove them:

pat = re.compile(r'[\x00-\x1f\x80-\xff]+')
s = "we are \xaf\x06OK\x03family, good"
print(s)
print(re.sub(pat, '', s))

An additional complication is that the regex engine has its own internal uses for backslashes; we generally prefer to use raw strings for regexes in order to not have Python and the regex engine both interpreting backslashes (sometimes in incompatible ways).

tripleee
  • 158,107
  • 27
  • 234
  • 292
0

Another approach:

pat = re.compile(r'[^\w\d\s,]+')
s = "we are \xaf\x06OK\x03family, good"
print(' '.join(map(lambda x: x.strip(), pat.split(s))))
#=> we are OK family, good

Used reverse match, remove(split by) any characters that are not what you wanted.

Tiw
  • 5,078
  • 13
  • 24
  • 33