0

I'm trying to take Regex substring one mismatch in any location of string and turn it into a big data situation where I can:

Match all instances of big substrings such as SSQPSPSQSSQPSS (and allowing only one possible mismatch within this substring) to a much larger string such as SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQSPSSQSSQPSS.

In reality, my substrings and the strings that I match them to are in the hundreds and sometimes even thousands of letters and I wish to incorporate the possibility of mismatches.

How can I scale the regex notation of Regex substring one mismatch in any location of string to solve my big data problems? Is there an efficient way to go about this?

Community
  • 1
  • 1
warship
  • 2,776
  • 6
  • 35
  • 60

1 Answers1

0

You may try this,

>>> s = "SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQSPSSQSSQPSS"
>>> re.findall(r'(?=(SSQPSPSQSSQPSS|[A-Z]SQPSPSQSSQPSS|S[A-Z]QPSPSQSSQPSS|SS[A-Z]PSPSQSSQPSS))', s)
['SSQPSPSQSSQPSS', 'SSQPSPSQSSQPSS', 'SSQPSPSQSSQPSS', 'SSQPSPSQSSQPSS']

Likwise add pattern with replacing remaining chars with [A-Z].

Avinash Raj
  • 166,785
  • 24
  • 204
  • 249