0

In short: I am trying to find if a certain string exists in between two strings/groups.

Background: I am moving a Confluence Data Center to Cloud. Some macros or macro combinations inside each other are not supported in the future. I want to find problematic pages beforehand.
I can search the entire data base in the page storage format with Regex with the Plugin "Search and Replace" from Rumpelcoder. (Docu of their Regex functionality.) I am not planning to replace automatically, but to just find the pages at the moment. Most problem combinations will only occur on 0-50 out of 20000 pages, and many of these cases require a human brain then to fix them inside the Wiki page rather than the Storage Syntax. So the start and end of the Regex can be rough, i.e. include the buzz words or not or anything. And I don't need saved bits that could be rearranged and saved again.

Solution possibility thoughts of the noob:
a) via lookbehind and lookahead? Maybe in negative as well.
b) just three groups, like I did. With a correct middle group.
c) counting bracket {} for the 2nd and 3rd search group. Then check if 2 is before 3 (then skip) or 3 before 2 (then match). Sounds most valuable yet. :-)
d) an operation to find a string in an already defined match.
e) count the number of occurrences of group 2 between 1 and 3. If the occurrences are zero then it's fine and it can search for the next group 1. Or some other clever contruct. :-) Any working solution would be highly welcome. Thanks. And I am sorry if I did not yet found the appropriate answer here in stackoverflow. Maybe also because of the many possibilities and my non-knowledge to discard the pointless attempts and pick the potent one.

The storage format of Confluence looks a bit like HTML with some <...> and </...> that can be used as search terms.

The best I came up with yet was an attempt of type b) with 3 groups:
/(<ac:structured-macro)(.*?"toc".*?)(<\/ac:structured-macro>)/gm
Should search multiple lines.
Searches for all results on the page with the g. Handy for the example here, but not really necessary in the real search, because already 1 match on a page will end in human brain work. And without the "g" should be by a factor faster on all pages that have >1 hits.
Starts any match searching when "<ac:structured-macro" occurs.
And when "toc" occurs, it will finish when the next "</ac:structured-macro>" occurs.

And obviously this is different from the original question of: it finishes definitely at the next "</ac:structured-macro>". And matches if there is a "toc" between the first and the last group. And matches not if there is no "toc" between them.
Because if the "toc" occurs after the first "</ac:structured-macro>" above mentioned search will just carry on to any random next "</ac:structured-macro>" in any paragraph.

Here some text examples for the search. The entire block can be pasted into a search box.

  1. Should not match here. (But still does so wrongly with the original syntax.)
    <ac:structured-macro text. no dangerous structure or content and thus no buzzwords from the Regex. ends with: </ac:structured-macro> more text. Should not be in the match here! Somewhere later an out of the danger-zone and therefore harmless buzzword: ac:name="toc" text. random end of anything somewhere in the file: </ac:structured-macro>.
  2. Should match here:
    <ac:structured-macro text. dangerous macro with syntax buzz word ac:name="toc" more text. End of outer macro: </ac:structured-macro>.
  3. Should match multilines:
    <ac:structured-macro text. any amount of line breaks.
    dangerous macro: ac:name="toc" more text.

or any completely empty line.
End of macro: </ac:structured-macro>.
4. Should not match here:
<ac:structured-macro text. nothing dangerous. End of macro: </ac:structured-macro>.

  • Okay, the problem is solved with: ```/(\)[\s\S])*?("toc")[\s\S]*?()/gm``` From the page https://stackoverflow.com/questions/47296024/regex-match-sequence-of-three-strings-along-with-text-inbetween – Peter Mauer Dec 06 '21 at 08:32

0 Answers0