135

I need to capture multiple groups of the same pattern. Suppose, I have the following string:

HELLO,THERE,WORLD

And I've written the following pattern

^(?:([A-Z]+),?)+$

What I want it to do is to capture every single word, so that Group 1 is : "HELLO", Group 2 is "THERE" and Group 3 is "WORLD". What my regex is actually capturing is only the last one, which is "WORLD".

I'm testing my regular expression here and I want to use it with Swift (maybe there's a way in Swift to get intermediate results somehow, so that I can use them?)

UPDATE: I don't want to use split. I just need to now how to capture all the groups that match the pattern, not only the last one.

Wiktor Stribiżew
  • 561,645
  • 34
  • 376
  • 476
phbelov
  • 1,859
  • 3
  • 16
  • 14
  • 6
    why not split on `,`? – rock321987 May 03 '16 at 12:04
  • why not use `[A-Z]+` or `[^,]+` to capture the results – rock321987 May 03 '16 at 12:07
  • rock321987, I've updated the input string. I need to extract exactly the string that follows the above pattern. And I need to get all the groups matched the pattern, not only the last one. I want to know how to do it with regex. – phbelov May 03 '16 at 12:09
  • need more input and output..its still not clear – rock321987 May 03 '16 at 12:11
  • 4
    rock321987, what is unclear? I need every word of the string to be a matched group, but my pattern only captures the last one ("WORLD"). – phbelov May 03 '16 at 12:14
  • you updated your question and it became unclear..now you are back to where you started..one more thing..remove anchors `^` and `$`(it still won't work)..`(?:([A-Z]+),?)` will work but you need to find all matches using global flag – rock321987 May 03 '16 at 12:16
  • 1
    use this [answer](http://stackoverflow.com/a/27880748/1996394) for finding all matches – rock321987 May 03 '16 at 12:19

11 Answers11

87

With one group in the pattern, you can only get one exact result in that group. If your capture group gets repeated by the pattern (you used the + quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.

You have to use your language's regex implementation functions to find all matches of a pattern, then you would have to remove the anchors and the quantifier of the non-capturing group (and you could omit the non-capturing group itself as well).

Alternatively, expand your regex and let the pattern contain one capturing group per group you want to get in the result:

^([A-Z]+),([A-Z]+),([A-Z]+)$
Byte Commander
  • 5,981
  • 4
  • 37
  • 66
  • 34
    How would this be adjusted to account for a varying number of strings? e.g. HELLO,WORLD and HELLO,THERE,MY,WORLD. I'm looking for just one expression to handle both examples and with flexibility built in for even longer string arrays – Chris Feb 26 '18 at 20:07
  • 14
    @Chris It can't be generalized. As the answer states, a capture group can only capture one thing, and there's no way to create a dynamic number of capture groups. – Barmar Oct 04 '18 at 17:02
  • 1
    Re "_How would this be adjusted to account for a varying number of strings?_" -- For those who still come to this page -- build it dynamically using the tools of the language at hand. Take the subpattern (`([A-Z]+)` here) as a string or as a regex pattern (depending on the language) and join N of them (with commas in this case), and then turn that into a regex pattern or just use it in regex (again, depending on the language). It's usually fairly simple. (I assumed this answer to take that for granted, that one can build it dynamically.) – zdim Jan 19 '22 at 20:31
22

The key distinction is repeating a captured group instead of capturing a repeated group.

As you have already found out, the difference is that repeating a captured group captures only the last iteration. Capturing a repeated group captures all iterations.

In PCRE (PHP):

((?:\w+)+),?
Match 1, Group 1.    0-5      HELLO
Match 2, Group 1.    6-11     THERE
Match 3, Group 1.    12-20    BRUTALLY
Match 4, Group 1.    21-26    CRUEL
Match 5, Group 1.    27-32    WORLD

Since all captures are in Group 1, you only need $1 for substitution.

I used the following general form of this regular expression:

((?:{{RE}})+)

Example at regex101

ssent1
  • 407
  • 2
  • 4
  • 8
    "Capturing a repeated group captures all iterations." In your regex101 try to replace your regex with `(\w+),?` and it will give you the same result. The key here is the `g` flag which repeats your pattern to match into multiple groups. – Thomas LAURENT Jan 22 '21 at 11:21
  • This is so wrong. "Capturing a repeated group captures all iterations": yes but it will capture ALL of them in only ONE match (containing them all). Your example should be `((?:\w,?)+)` . You have multiple matches here only because of the g flag as @thomas-laurent stated. There is no way to have multiple matches from one capturing group. You have to extract and preg_match_all (or equivalent function) the repeating group. – Pierre Dec 23 '21 at 14:36
  • @Pierre Thanks for your clarification. Based on the original question(s), we have to make an assumption about what is needed. First, he said, "I want to capture every word, so that Group 1 is: `HELLO`…Group 3 is `WORLD`…" Your distinction is important because unique backreference groups are necessary for this case. The table above shows all matchs assigned to `Group 1`. As a result, `((?:\w+)+),?` does not work. Going on to sum up, he said, "I need to capture all the groups that match the pattern, not only the last one." `((?:\w+)+),?` accomplishes this with the `g` flag enabled. – ssent1 Jan 08 '22 at 16:53
  • @ssent1 Your `((?:\w+)+),?` is equivalent to `(\w+),?`. Your enclosing anonymous group is **never repeated**. This misleading, **there is nothing like "capturing a repetated group [in multiple matches]"**. Unfortunately, nothing in regexp can match multiple times the same group. There is only the g flag and preg_match_all that executes the regexp iteratively on the remaining unmatched string. – Pierre Jan 09 '22 at 17:38
  • (Note that in my first answer there is a typo in the regex I suggested, should read: `((?:\w+,?)+)`) – Pierre Jan 09 '22 at 17:48
  • 1
    @Pierre You're correct. And yet it seems like there is still a distinction to be made between [Repeating a Capturing Group vs. Capturing a Repeated Group])(https://www.regular-expressions.info/captureall.html). On a practical level, it could be part of a functional solution. Ultimately, if a 'bulletproof' solution is needed, it's probably better to do it programmatically. – ssent1 Feb 17 '22 at 16:45
8

I think you need something like this....

b="HELLO,THERE,WORLD"
re.findall('[\w]+',b)

Which in Python3 will return

['HELLO', 'THERE', 'WORLD']
Tim Seed
  • 4,707
  • 2
  • 27
  • 25
5

After reading Byte Commander's answer, I want to introduce a tiny possible improvement:

You can generate a regexp that will match either n words, as long as your n is predetermined. For instance, if I want to match between 1 and 3 words, the regexp:

^([A-Z]+)(?:,([A-Z]+))?(?:,([A-Z]+))?$

will match the next sentences, with one, two or three capturing groups.

HELLO,LITTLE,WORLD
HELLO,WORLD
HELLO

You can see a fully detailed explanation about this regular expression on Regex101.

As I said, it is pretty easy to generate this regexp for any groups you want using your favorite language. Since I'm not much of a swift guy, here's a ruby example:

def make_regexp(group_regexp, count: 3, delimiter: ",")
  regexp_str = "^(#{group_regexp})"
  (count - 1).times.each do
    regexp_str += "(?:#{delimiter}(#{group_regexp}))?"
  end
  regexp_str += "$"
  return regexp_str
end

puts make_regexp("[A-Z]+")

That being said, I'd suggest not using regular expression in that case, there are many other great tools from a simple split to some tokenization patterns depending on your needs. IMHO, a regular expression is not one of them. For instance in ruby I'd use something like str.split(",") or str.scan(/[A-Z]+/)

Ulysse BN
  • 8,503
  • 5
  • 45
  • 74
3

Just to provide additional example of paragraph 2 in the answer. I'm not sure how critical it is for you to get three groups in one match rather than three matches using one group. E.g., in groovy:

def subject = "HELLO,THERE,WORLD"
def pat = "([A-Z]+)"
def m = (subject =~ pat)
m.eachWithIndex{ g,i ->
  println "Match #$i: ${g[1]}"
}

Match #0: HELLO
Match #1: THERE
Match #2: WORLD
AndyJ
  • 1,072
  • 3
  • 11
  • 25
3

The problem with the attempted code, as discussed, is that there is one capture group matching repeatedly so in the end only the last match is kept.

What is needed is to instruct the regex to capture all matches, what is available in any regex implementation (language). Then the trick is in writing the pattern so that it indeed matches all instances.

The defining property of the shown sample data is that the patterns of interest are separated by commas so I'd suggest to match anything-but-a-comma, using a negated character class

[^,]+

and match (capture) globally -- get all matches in the string.

If your pattern need be more restrictive adjust the exclusion list. For example, to capture words separated by any of the listed punctuation

[^,.!-]+

This extracts all words from hi,there-again!, without the punctuation. (The - should be given first or last in a character class.)

In Python

import re

string = "HELLO,THERE,WORLD"

pattern = r"([^,]+)"
matches = re.findall(pattern,string)

print(matches)

in Perl (and many other compatible systems)

use warnings;
use strict;
use feature 'say';

my $string = 'HELLO,THERE,WORLD';

my @matches = $string =~ /([^,]+)/g;

say "@matches";

(In this specific example the capturing () in fact aren't needed since we collect everything that is matched. But they don't hurt and in general they are needed.)

zdim
  • 56,772
  • 4
  • 49
  • 75
1

I know that my answer came late but it happens to me today and I solved it with the following approach:

^(([A-Z]+),)+([A-Z]+)$

So the first group (([A-Z]+),)+ will match all the repeated patterns except the final one ([A-Z]+) that will match the final one. and this will be dynamic no matter how many repeated groups in the string.

  • 6
    This is not a solution to the problem. The question is not about matching the string, but about capturing all the groups. This regex still only captures the last match for the first, repeating group (with comma), plus the match in the final group (without comma). – gdwarf May 21 '20 at 09:53
1

You actually have one capture group that will match multiple times. Not multiple capture groups.

javascript (js) solution:

let string = "HI,THERE,TOM";
let myRegexp = /([A-Z]+),?/g;       // modify as you like
let match = myRegexp.exec(string);  // js function, output described below
while (match != null) {             // loops through matches
  console.log(match[1]);            // do whatever you want with each match
  match = myRegexp.exec(string);    // find next match
}

Syntax:

// matched text: match[0]
// match start: match.index
// capturing group n: match[n]

As you can see, this will work for any number of matches.

Dheemanth Bhat
  • 3,896
  • 1
  • 13
  • 33
0

Sorry, not Swift, just a proof of concept in the closest language at hand.

// JavaScript POC. Output:
// Matches:  ["GOODBYE","CRUEL","WORLD","IM","LEAVING","U","TODAY"]

let str = `GOODBYE,CRUEL,WORLD,IM,LEAVING,U,TODAY`
let matches = [];

function recurse(str, matches) {
    let regex = /^((,?([A-Z]+))+)$/gm
    let m
    while ((m = regex.exec(str)) !== null) {
        matches.unshift(m[3])
        return str.replace(m[2], '')
    }
    return "bzzt!"
}

while ((str = recurse(str, matches)) != "bzzt!") ;
console.log("Matches: ", JSON.stringify(matches))

Note: If you were really going to use this, you would use the position of the match as given by the regex match function, not a string replace.

Orwellophile
  • 12,311
  • 3
  • 63
  • 40
0
  1. Design a regex that matches each particular element of the list rather then a list as a whole. Apply it with /g
  2. Iterate throught the matches, cleaning them from any garbage such as list separators that got mixed in. You may require another regex, or you can get by with simple replace substring method.

The sample code is in JS, sorry :) The idea must be clear enough.

    const string = 'HELLO,THERE,WORLD';

    // First use following regex matches each of the list items separately:
    const captureListElement = /^[^,]+|,\w+/g;
    const matches = string.match(captureListElement);

    // Some of the matches may include the separator, so we have to clean them:
    const cleanMatches = matches.map(match => match.replace(',',''));

    console.log(cleanMatches);
0

repeat the A-Z pattern in the group for the regular expression.

data="HELLO,THERE,WORLD"
pattern=r"([a-zA-Z]+)"
matches=re.findall(pattern,data)
print(matches)

output

['HELLO', 'THERE', 'WORLD']
Golden Lion
  • 2,792
  • 2
  • 19
  • 29