-1

I have a case where I'm using regex to check a string for matches, after a start index and before an end one.

To set the start point, it is possible to create a new Regexp() and set it's index before running exec().

I can not find a way, though, to limit the depth of its search. One "obvious" solution might be to use substring() to get a string that can't be searched passed a point, but it would have a massive negative impact on performance. Any solution to setting a limit on regex search depth that includes substring() will not do, and is IMO embarrassingly inefficient, so please do not post them.

The three ways I can imagine fixing this are:

  1. if there is some way to set a limit, such as myRegex.exec(string, stopIndex) or myString.length = temporaryFakeLimit; //do regex, reset length. Neither of these work or exist.

  2. if there was some way to check for a regex match at an index in a string ie myRegex.testAt(myString, indexToCheck), I could iterate through the string myself, checking for matches.

  3. if there was a non-native implementation of regex, which is capable of doing either of the above.

So far I have not been able to find a good solution.

How can I check if a Regex has a match in a string before some index, without it searching the entire string?


Edit (Not by OP)

To add further:

If it doesn't find a match, and I'm only looking 100 chars deep on a 100000 char text, that's an issue.

How can we search a portion of a given string using RegEx. This portion of string should be determined by parameters indicating a start index and an end index.

Example of Parameters

var str = `Check enhancement bonus fear effect fly initiative check panicked points of damage rounding stunned touch attack unarmed strike. Aquatic subtype attack of opportunity catching on fire charm conjuration deafened evasion evil domain fast healing favored class fire domain gaseous form healing subschool incorporeal melee weapon multiplying skill points stunned summoning subschool take 10 turn. Adjacent class concentrate on a spell energy drained infection intelligence invisible law domain mundane nonlethal damage small.`;

var rgx = /\d\d/;

var start = 300;

var end = 500;

The result should be:

10

zer00ne
  • 36,692
  • 5
  • 39
  • 58
Seph Reed
  • 6,511
  • 8
  • 42
  • 89
  • What is the input string and expected result? – guest271314 Nov 03 '17 at 00:04
  • There is no input string or expected result. This is not that kind of RegExp question. If you must have one, the string is `"test string"`, the limit of depth is 4, and the regex is `/i/`. It should return no match, because "string" is after index 4. – Seph Reed Nov 03 '17 at 00:04
  • _"There is no input string or expected result. This is not that kind of RegExp question."_ ? Not sure what you mean? An input string is necessary to match a portion of the string, else no problem statement exists at the Question relevant to `RegExp`. – guest271314 Nov 03 '17 at 00:05
  • This is a question about a function of RegExp that doesn't appear to exist. There is no way to create code using a function that doesn't exist. – Seph Reed Nov 03 '17 at 00:07
  • What do mean mean by "the limit of depth"? Do you mean match `"test"` only is `"test"` if after index `4` within `"test string"`? – guest271314 Nov 03 '17 at 00:08
  • The index in the string at which searching should stop. A limit, or range, for cases where the entire string need not be searched. – Seph Reed Nov 03 '17 at 00:09
  • Just check the `RegExp.lastIndex` value after each match and if it is bigger than your limit, stop. – Wiktor Stribiżew Nov 03 '17 at 00:09
  • "Without searching the entire string" is in the question title. It's also really inefficient in this case, the difference between searching 100 chars and 10,000. – Seph Reed Nov 03 '17 at 00:10
  • One approach would be to use a loop and `break` – guest271314 Nov 03 '17 at 00:10
  • a `for` loop will only work if there is a way to check a regex for matching an index. See #2 in my question, or the title of this question. – Seph Reed Nov 03 '17 at 00:11
  • @SephReed That is not accurate. You can begin and break the loop at any index of the string that you select – guest271314 Nov 03 '17 at 00:12
  • Not while using regex. It just blasts along until it gets a match. It's for loop is not exposed. If it was, this would be really easy. – Seph Reed Nov 03 '17 at 00:12
  • @SephReed You can use the `RegExp` within the `for` loop and specify the range of indexes which should be iterated, or simply utilize `===` operator and omit use of `RegExp` altogether. Not certain why you are attempting to complicate the procedure? – guest271314 Nov 03 '17 at 00:14
  • one of us must be misunderstanding the other. Would you mind showing me what you mean, because as far as I know, you're talking about matching strings, not using RegExp – Seph Reed Nov 03 '17 at 00:18
  • 2
    The purpose of `RegExp` is to match or not match strings. What is the definition of `RegExp` that you are using at the present Question? – guest271314 Nov 03 '17 at 00:18
  • @SephReed Sorry for not explaining the downvote, but I kept on making a list of criticisms and I don't think it would be that productive. So in short, the questioner should be considerate and update the question to reflect one's comments. Comments are ephemeral this not a chat room. Btw your answer is overcomplicated see my new answer. It can actually go through 100,000 characters very well because this is the 21st century and browsers are much more powerful than they were back in the 90s. – zer00ne Nov 04 '17 at 14:23

4 Answers4

1

I'm not sure if I understand correctly your requirements but I think you can use something like this (^.{0,4})(the actual regex)(.*) and adjust the length by changing 4 with [max size] - [actual regex length].

I don't know how exactly the regex engine is implemented in JavaScript but I think that by using something like this the engine won't process the entire string.

The example above will start the search at the start of the string. For starting at an offset, you can use something like this:

(^.{5}.{0,4})(the actual regex)(.*) where 5 is the offset.

Titus
  • 21,281
  • 1
  • 21
  • 33
  • first off, thank you for showing an understanding of what I'm trying to do. I don't think it's that complex of a notion. I like the idea you've shown, but it seems like it might be very wasteful with the way it collects it's matches. Trying over and over for each index would create lots of returned match strings, each bigger than the last. Still, thank you very much. This is very clever. – Seph Reed Nov 03 '17 at 00:24
  • as a clever person, you might like the solution that was found to this. See my answer above, if you're interested. It's the most I can do to thank you. – Seph Reed Nov 03 '17 at 00:59
0

You can use a for loop and break

let str = "test string";

let match = "string";

let limit = 4;

let res = "";

for (let i = 0; i < str.length; i++) {
  console.log(i, i < limit);
  if (i >= limit - 1) {
    if (!res) {
      res = null;
    }
    break;
  };
  // alternatively, using `RegExp`
  // new RegExp(str[i], "i").test(match[i])
  if (str[i] === match[i]) { 
    res += str[i]
  }
}

console.log(res);
guest271314
  • 1
  • 12
  • 91
  • 170
  • This is a non-native implementation of indexOf(). I'm trying to match a RegExp. change match to `/[i-k]/`, and you'll see what I mean. – Seph Reed Nov 03 '17 at 00:26
  • @SephReed What are you talking about? `.indexOf()` is not used at Answer. Would suggest providing an actual example of what you are trying to achieve instead of fragments of pseudo code that have no meaning when viewed at discreet comments; see https://stackoverflow.com/help/mcve – guest271314 Nov 03 '17 at 00:27
  • Are you trying to create a `RegExp` similar to the code at Question at [Most efficient method to check for range of numbers within number without duplicates](https://stackoverflow.com/questions/34487374/most-efficient-method-to-check-for-range-of-numbers-within-number-without-duplic)? – guest271314 Nov 03 '17 at 00:29
  • You just wrote what's essentially an indexOf function. It does not have any RegExp functionality, for instance it can't match `/.+a?[1-5]/` – Seph Reed Nov 03 '17 at 00:30
  • You can match whatever you want to match. Though you need to provide a concrete example of exactly what you are trying to achieve. As it stands your Question is not clear and becomes even less clear at each new comment which does not refer to the actual code used being included at original Question. Consider taking the time to create a minimal, complete verifiable example of input, expected result and the parameters required to be adhered to to get expected result; see https://stackoverflow.com/help/how-to-ask, https://stackoverflow.com/help/mcve – guest271314 Nov 03 '17 at 00:31
  • I answered my own question. I understand MVCE questions, this is not a case in which one is possible because I'm asking for a function that does not yet exist. I'm going to put this bluntly, I think you just lack the facilities to grasp the question. – Seph Reed Nov 03 '17 at 00:54
  • If your solution meets your own requirement, consider marking your own Answer as accepted. It may be that am lacking the facilities to grasp the question; that is certainly possible. Half of the catch is the throw. – guest271314 Nov 03 '17 at 00:55
  • I'm content with the throw I made. It's unique at worst. – Seph Reed Nov 03 '17 at 00:58
  • @SephReed Have traveled that path before. See also https://stackoverflow.com/help/self-answer – guest271314 Nov 03 '17 at 00:59
  • 2 days from now, it will be possible. Thank you for your help. Sorry if my temper was hard to keep. Getting down voted for a real, non-trivial question, is always harder than I bargain for. – Seph Reed Nov 03 '17 at 01:01
  • 1
    @SephReed No worries. Have traveled that path before as well. "downvote" without accompanying reason is repugnant to this user, though encouraged by (meta) SO culture. Have not cast a "downvote", here that sustained; cast one once, then retracted it. All questions have answers, that is the basis of the universe; where effort is put forth to ask a question. Stated what meant here at the time, that the Question was not clear, and cast vote for Question closure, not "downvote" of the Question itself. Will re-read your Answer to determine what the code does that a `for` loop cannot achieve. – guest271314 Nov 03 '17 at 01:04
  • @SephReed fwiw, spent over a year doing the math by hand and considering this Question [How to improve efficiency of algorithm which generates next lexicographic permutation?](https://stackoverflow.com/questions/43290467/how-to-improve-efficiency-of-algorithm-which-generates-next-lexicographic-permut) before actually posting it; and still have the sense that there is a solution to the inquiry which has yet to be thoroughly manifested. There are half as many "downvote"s cast for the Question as "upvote"s. – guest271314 Nov 03 '17 at 01:10
  • on reviewing my code, it is apparent that a for loop could possibly be used. The major thing introduced by my code, though, is not the loop so much as the "hack" of making the regex match any character if it doesn't match the desired regexp first. Without this concept, the for or while loop wouldn't be able to keep the .exec from going to the end. – Seph Reed Nov 03 '17 at 01:14
  • @SephReed The complete description of the requirement at OP could possibly be improved, yes? _"is not the loop so much as the "hack" of making the regex match any character if it doesn't match the desired regexp first. Without this concept, the for or while loop wouldn't be able to keep the .exec from going to the end."_ does not appear at text of original Question. If that is meaningless now, you can simply accept your own Answer. If your goal is to both convey your pursuit in its full distinction for posterity and the solution, perhaps consider providing further details at both your Q&A? – guest271314 Nov 03 '17 at 01:19
0

EDIT: ignore the following, the true answer to this question is that substr does not work in js the way I thought it did. In native browser code, strings are immutable, so when substr() is called, it can just reference the original string with a new start and length property. It's super efficient.


OLD ANSWER:

I've managed to figure out a solution for this, although it's ugly. The idea is to use a regex that matches the thing you are looking for, or anything. Because it will match anything if it doesn't match what you want, it moves one char at a time. Using capturing groups, you can check if it's matching the thing you want, or the dummy matches.

The code for it is like so:

var stringToSearch = "hey hi ho here we go, woop de do an2 for you, lalala lelele lolol goop da woop ba bee";

var whatIWant = "a.+[1-3]";
var anyChar = ".|\n"

var myRegex = new RegExp("("+whatIWant+")|"+anyChar, 'g');


function boundedRegexSearch(searchMe, regex, start, end) {
  regex.lastIndex = start; //start search index
  
  var keepSearching = true;
  while(keepSearching) {
    var result = regex.exec(searchMe);
    var whatImSearchingFor = result[1];
  
    if(whatImSearchingFor !== undefined)
      return whatImSearchingFor;
   
    keepSearching = regex.lastIndex < end;
  }
}


//can not match anything from index 0 to 8, returns undefined
console.log(boundedRegexSearch(stringToSearch, myRegex, 0, 8));

//can match "an2" at index 36
console.log(boundedRegexSearch(stringToSearch, myRegex, 20, 40));
Seph Reed
  • 6,511
  • 8
  • 42
  • 89
  • Good use of `lastIndex`, the `.|\n` does the trick, it increases `lastIndex` by one at each iteration. – Titus Nov 03 '17 at 01:14
  • while also dead-ending anything that doesn't match the desired regex. thank heavens there was a way. I'm building a cpp parser in node js, and my next best bet was writing RegExp from scratch. – Seph Reed Nov 03 '17 at 01:16
  • Why did you take the time to edit your Answer, but not edit the original Question to provide greater clarity as to what the requirement is; for the Answer to match the Question 1:1? – guest271314 Nov 03 '17 at 01:34
  • I think the question and answer are matched 1:1 pretty well atm. I don't know how much clearer the question could be, but you are welcome to make edits. Your perspective on what might be confusing is far greater than mine. I will be accepting this as the answer after 2 days (min time allowed). – Seph Reed Nov 03 '17 at 02:04
  • Never use `(.|\n)` to match any char in a JS regex. Either use `[^]` or `[\s\S]` (if you need to port it later to other NFA regex engines) – Wiktor Stribiżew Nov 03 '17 at 07:22
  • In my new answer, [my simple function is 81% faster than boundedRegexSearch() when searching a string of over 100,000 character.](https://jsperf.com/so47086482) boundedRegexSearch() fails on many levels, but it should be obvious that using .|\n to drive incrementing a search is unnecessary and the g flag shouldn't even be used if you want to stop after the first search. There's no point in crawling over each char. the methods provided to the String and Regex Objects have optimal algorithms baked right in. And yes, even `substring()` can be used in a far better way and be faster and efficient. – zer00ne Nov 04 '17 at 13:52
0

OP Question

How can I check if a Regex has a match in a string before some index, without it searching the entire string?

Comment added to question

If it doesn't find a match, and I'm only looking 100 chars deep on a 100000 char text, that's an issue. And this isn't about an expected output, it's about a function that doesn't seem to exist.

Taking both of these requirements from a badly managed question into consideration, we should do the following:

  • Take 4 parameters: the string str, the RegEx tgt, the start index ptA, and the end index ptZ

  • Extract a new string by using slice(ptA, (ptZ+1)). The ptZ+1 is because ptZ is exclusive.

  • Then use match() on the new string using tgt.

Since the OP emphasized performance, here's a test between OP answer and this answer at jsPerf.

Demo

const str = `Anything suspicious? Well... then should we go? Any uh... Cartel news these days? Seems like I'm always reading something or other in the paper. I don't want to talk about it. To you or anyone else. I'm done explaining myself. Gus is dead. We've got work to do. `;

let tgt = /Cartel/;

function searchIndexRange(str, rgx, ptA, ptZ) {
  const frag = str.slice(ptA, (ptZ + 1));
  let exists = frag.match(rgx);
  if (exists !== -1) {
    return exists;
  } else return false;
}

// First call returns null 
// Second call returns the search key "Cartel"

let res = JSON.stringify(searchIndexRange(str, tgt, 0, 1));
console.log(res);

res = JSON.stringify(searchIndexRange(str, tgt, 0, 100));
console.log(res);
Community
  • 1
  • 1
zer00ne
  • 36,692
  • 5
  • 39
  • 58
  • While `splice` is pretty much the same as `substring`, your point still stands. I was mistaken to think that creating the same string over and over did not have some extra efficiencies (most likely not actually doing it until a copy is altered), that would make it so fast. I suppose I over thought this whole thing, and forgot that js is not a low level language. I only wish it was more transparent what it's actually doing. – Seph Reed Nov 04 '17 at 21:43
  • Trust in the methods already provided, the algorithms within are optimized to consider all possibilities that matter (not all of them of course otherwise it'd be perfect world without bugs.) – zer00ne Nov 04 '17 at 21:52
  • I prefer understanding to trust. By not trusting, it's unearthed the potential for depth in the String functions I had never realized was there. I've spent a lot of time worrying about needlessly making copies of strings in js, and now I'm beginning to suspect it only pretends to do the copy. And while this is cool, I'd rather just know. It makes me wonder just how different the native functions actions are from their facades. – Seph Reed Nov 04 '17 at 21:59
  • I can't afford to pontificate, my job is purely based on results and performance, much like how SO is in a way. If it's broken fix it. Perhaps https://codereview.stackexchange.com/ is more suited for you? – zer00ne Nov 04 '17 at 22:04
  • Also, I just spent the passed 2 days implementing RegExp. The code for one is amazingly short, but very complex. Unfortunately, my non-native implementation is about 87x slower, but the project I'm currently on is entirely proof of concept right now and will be moving from js to c++ once everything has been proven to work (with out yet being super efficient). – Seph Reed Nov 04 '17 at 22:05
  • point is, I'm working on efficiency as well, but in a strange way. I'm coding in js, to redo it later in c++. If something is only fast because js is doing something tricky, well my question still falters, but it is ultimately not going to work out when the project switches languages. That's why I felt it so important to not use substring or the like. – Seph Reed Nov 04 '17 at 22:08
  • C++ is 100 times more powerful, but if I remember from my college days, C++, C, and Java handles arrays differently which is a good indication of how memory is allocated. C++ has pointers, real classes, block scope, etc... good luck, sir. – zer00ne Nov 04 '17 at 22:19
  • thanks. Also, in case you ever try to write regex, the trick is to use recursive calls for each char (l(i(k(e( (t(h(i(s))))))), while passing call backs to parent groups. The hardest case is repeated groups in which you first pass the match end, and then move back. ie `(a|b)+b` in `abababba`, first gets to the end of the string trying to be big, then has to move back until it can end on a 'b'. – Seph Reed Nov 05 '17 at 00:08
  • Like I said, results and performance is what I get paid for, edge cases such as `abababba` are not a real concern. Calling recursive calls on a string for C++ you can try `regex_match` function and use pointers to each match or something like that I haven't used C++ for years. – zer00ne Nov 05 '17 at 06:18