5

How do I make a regular expression to evaluate the following string?

TGATGCCGTCCCCTCAACTTGAGTGCTCCTAATGCGTTGC

and extract the pattern CTCCT.

The pattern must be 3 C's and 2 T's in any order.

I tried /[C | T]{5}/ but it matches CCCCT and TCCCC

Thanks in Advance.

le_m
  • 17,771
  • 9
  • 60
  • 73

2 Answers2

3

This isn't the type of problem that is easily solved using Regular Expressions. It can be solved fairly straighforwardly with a simple function, however

 function c3t2(str) {
  var lowerCaseStr = str.toLowerCase();
  for (index = 0; index + 5 <= str.length; index++) {
    var substring = lowerCaseStr.substring(index, index + 5);
    var chars = substring.split("");
    if (chars.sort().join("") === "ccctt") {
      return index;
    }
  }

  return false;
}
Andrew Rueckert
  • 4,433
  • 1
  • 30
  • 40
  • I'm happy to see this answer because regex is overused. I can't upvote yet though, because it should be `index + 4 < str.length` – Millie Smith Jun 15 '16 at 01:24
  • Ah, you're right! Used `<=` instead, to emphasize the relationship with the `index + 5` below. – Andrew Rueckert Jun 15 '16 at 01:33
  • True. it's probably best to do a case insensitive comparison or just tolowercase the whole thing. – Millie Smith Jun 15 '16 at 01:37
  • 1
    Agreed. To-Lower-Cased it up front **for efficiency**. :P – Andrew Rueckert Jun 15 '16 at 01:43
  • 1
    See https://jsfiddle.net/24qguege/ for efficiency :P - perhaps you could speed this up by splitting the string beforehand and iterating over an array or just performing a simple string.search() for each unique permutation? – le_m Jun 15 '16 at 01:44
  • 1
    Haha, yeah. If I'm gunning for efficiency (https://jsfiddle.net/24qguege/3/), I can beat the regex, using some of the same tricks that the regex engine does. Readability kind of goes out the window, though, and I wanted to highlight how simple the solution could be. – Andrew Rueckert Jun 15 '16 at 02:02
  • @AndrewRueckert Looking at your fiddle https://jsfiddle.net/24qguege/3/ it seems we had the same idea^^ https://jsfiddle.net/24qguege/4/ - can you make it even faster than regex? – le_m Jun 15 '16 at 02:28
  • @AndrewRueckert Your fiddle is incomplete, right? It only matches patterns starting at indices which are multiples of 5. – le_m Jun 15 '16 at 02:47
2

Compute all permutations of "CTCCT" and concatenate them to a regex:

CCCTT|CCTCT|CCTTC|CTCCT|CTCTC|CTTCC|TCCCT|TCCTC|TCTCC|TTCCC

This pattern can be optimized:

C(?:C(?:T(?:CT|TC)|CTT)|T(?:C(?:CT|TC)|TCC))|T(?:C(?:C(?:CT|TC)|TCC)|TCCC)

var regex = new RegExp(/C(?:C(?:T(?:CT|TC)|CTT)|T(?:C(?:CT|TC)|TCC))|T(?:C(?:C(?:CT|TC)|TCC)|TCCC)/g);

var string = "TGATGCCGTCCCCTCAACTTGAGTGCTCCTAATGCGTTGC";

console.log(regex.exec(string));

This pattern doesn't find overlapping matches, e. g. there would only be one match in CCCTTCCC.

To find overlapping matches, use lookahead:

C(?=C(?=T(?=CT|TC)|CTT)|T(?=C(?=CT|TC)|TCC))|T(?=C(?=C(?=CT|TC)|TCC)|TCCC)

var regex = new RegExp(/C(?=C(?=T(?=CT|TC)|CTT)|T(?=C(?=CT|TC)|TCC))|T(?=C(?=C(?=CT|TC)|TCC)|TCCC)/g);

var string = "CCCTTCCC";

while ((match = regex.exec(string)) != null) {
    console.log(match.index, string.substring(match.index, match.index + 5));
}

Regex can only deal with a fairly limited number of permutations. If you want to match segments of possibly arbitrary size, use a non-regex solution:

function c3t2_optimized(str) {
  var c = 0, t = 0;
  for (var i = 0; i < str.length; ++i) {
    var last = str.charAt(i);
    if (last == 'C') ++c;
    else if (last == 'T') ++t;
    if (i > 4) {
      var first = str.charAt(i - 5);
      if (first == 'C') --c;
      else if (first == 'T') --t;
    }
    if (c == 3 && t == 2) return i - 4;
  }
  return -1;
}

var string = "TGATGCCGTCCCCTCAACTTGAGTGCTCCTAATGCGTTGC";
      
console.log(c3t2_optimized(string));

Or the same as above, just as a generator stepping through all possibly overlapping matches:

function* c3t2_optimized(str) {
  var c = 0, t = 0;
  for (var i = 0; i < str.length; ++i) {
    var last = str.charAt(i);
    if (last == 'C') ++c;
    else if (last == 'T') ++t;
    if (i > 4) {
      var first = str.charAt(i - 5);
      if (first == 'C') --c;
      else if (first == 'T') --t;
    }
    if (c == 3 && t == 2) yield i - 4;
  }
}

var string = "CCCTTCCC";

for (i of c3t2_optimized(string)) {
  console.log(i, string.substring(i, i + 5));
}

Performance comparison: https://jsfiddle.net/24qguege/7/

Firefox 47:

Community
  • 1
  • 1
le_m
  • 17,771
  • 9
  • 60
  • 73
  • I'll upvote your answer since you put effort into it and compared it against the other one. I just want to note a few things for anybody who drops in. The non-regex answer is far better if you aren't calling this thousands of times. My tests were half your times (34 and 4955), so it took 0.05 milliseconds for the non-regex answer to run each time (100k iterations). If efficiency is truly needed, js isn't efficient but maybe you need it, and the optimized regex is far better because at its core it is just a deterministic finite state machine (it should be. haven't looked up the ?: characters). – Millie Smith Jun 15 '16 at 01:53
  • @MillieSmith Thanks. I added a non-regex answer which can handle arbitrarily large 'patterns'. – le_m Jun 15 '16 at 02:23
  • @le_m This is really hot man. Thanks. Do you want to see the whole problem? I have to search a string for all the possible combinations of CCCTT. – Ryan Pace Sloan Jun 15 '16 at 04:11
  • @le_m what would an algorithm to compute all the permutations look like? – Ryan Pace Sloan Jun 15 '16 at 04:13
  • @RyanPaceSloan Compute permutations: http://stackoverflow.com/a/37580979/1647737 - Keep only unique elements: http://stackoverflow.com/a/37821550/1647737 - For special cases such as CCCTT-permutations, there might even be more efficient algorithms – le_m Jun 15 '16 at 04:15
  • @RyanPaceSloan "Do you want to see the whole problem?" yes. Is it related to https://www.hackerrank.com/contests/hourrank-6/challenges/bear-and-steady-gene ? – le_m Jun 15 '16 at 04:46
  • I permutated all the variations of the string "CCCTT" and formed a regular expression with them. – Ryan Pace Sloan Jun 15 '16 at 05:24
  • http://stackoverflow.com/questions/17006808/find-all-permutations-of-a-string-in-c – Ryan Pace Sloan Jun 15 '16 at 05:44