5

I'm trying to use this regexp in R:

\?(?=([^'\\]*(\\.|'([^'\\]*\\.)*[^'\\]*'))*[^']*$)

I'm escaping like so:

\\?(?=([^'\\\\]*(\\\\.|'([^'\\\\]*\\\\.)*[^'\\\\]*'))*[^']*$)

I get an invalid regexp error.

Regexpal has no problem with the regex, and I've checked that the interpreted regex in the R error message is the exact same as what I'm using in Regex pal, so I'm sort of at a loss. I don't think the escaping is the problem.

Code:

output <- sub("\\?(?=([^'\\\\]*(\\\\.|'([^'\\\\]*\\\\.)*[^'\\\\]*'))*[^']*$)", "!", "This is a test string?")
John Chrysostom
  • 3,771
  • 1
  • 30
  • 47

1 Answers1

7

R by default uses the POSIX (Portable Operating System Interface) standard of regular expressions (see these SO posts [1,2] and ?regex [caveat emptor: machete-level density ahead]).

Look-ahead ((?=...)), look-behind ((?<=...)) and their negations ((?!...) and (?<!...)) are probably the most salient examples of PCRE-specific (Perl-Compatible Regular Expressions) forms, which are not compatible with POSIX.

R can be trained to understand your regex by activating the perl option to TRUE; this option is available in all of the base regex functions (gsub, grepl, regmatches, etc.):

output <- sub(
  "\\?(?=([^'\\\\]*(\\\\.|'([^'\\\\]*\\\\.)*[^'\\\\]*'))*[^']*$)",
  "!",
  "This is a test string?",
  perl = TRUE
)

Of course it looks much less intimidating for R>=4.0 which has raw string support:

output <- sub(
  R"(\?(?=([^'\\]*(\\.|'([^'\\]*\\.)*[^'\\]*'))*[^']*$))",
  "!",
  "This is a test string?",
  perl = TRUE
)
MichaelChirico
  • 32,615
  • 13
  • 106
  • 186