-2

All:

As the subject states, I'm running into an issue with Grep Perl Non-Greedy Scope RegEx Matching on an Empty String.

[Note: For the purposes of this example assume that the 'title' can be a complex, alpha-numeric, special-character, multi-word, space-separated, string.]

# echo "<span class=\"title\"></span><span class=\"price\">0.25</span><span class=\"title\">Banana</span><span class=\"price\">0.10</span><span class=\"title\">Grape</span><span class=\"price\">0.05</span>" | /opt/bin/grep -ioP "<span class=\"title\">(.+?)</span><span class=\"price\">(.+?)</span>" | sed "s/<span class=\"title\">//g; s/<span class=\"price\">/|/g; s/<\/span>//g;"
|0.25Banana|0.10
Grape|0.05

As you can see, the first 'title' match is empty, but the grep perl non-greedy scope regex (.+?) still matches.

Shouldn't the first 'title' match be ignored? What am I missing?

Thank you for your assistance.

UPDATE:

Negating the lessthan-sign ([^<]+?) is a good solution with the original, basic example. However, I'm finding that it runs into problems when more data is introduced.

I've attempted to expand the match to include additional trailing tags, but the regex appears to still be failing with that change as well.

# echo "<span class=\"title\"></span></div></div><span class=\"price\">0.25</span><span class=\"title\">Banana</span></div></a><span class=\"price\">0.10</span><span class=\"title\">Grape</span></div></a><span class=\"price\">0.05</span>" | grep -ioP "<span class=\"title\">(.+?)</span></div></a><span class=\"price\">(.+?)</span>" | sed "s/<span class=\"title\">//g; s/<span class=\"price\">/|/g; s/<\/span>//g; s/<\/div>//g; s/<\/a>//g;"
|0.25Banana|0.10
Grape|0.05

Shouldn't the regex match on the </span></div></a> tags, but not on the </span></div></div> tags?

Thanks, again, for your time and assistance.

Andy A.
  • 1,047
  • 2
  • 12
  • 27
  • 11
    [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) – Cyrus May 29 '21 at 10:01
  • In the pattern you use `(.+?)` where there should be at least a single char matched due to the `+` So it will match until the first closing span and it can not match the empty span. You can change it it `.*?` – The fourth bird May 29 '21 at 10:02
  • The goal is to not match the group, if the 'title' or 'price' is empty. Changing the regex to .*? will enable empty strings to match, which is counter to what I'm attempting to accomplish. I appreciate your feedback. – Gary C. New May 29 '21 at 11:32

2 Answers2

1

Your elected regular expression <span class="title">(.+?)</span> which assumes a presence at least one symbol in title tag - what leads regex to capturing from this place skipping empty tag until next closing </span> tag, definitely not what you intended to achieve.

Perhaps following code is self explanatory

use strict;
use warnings;

my $re = qr!<span class="title">(.+?)</span><span class="price">(.*?)</span>!;

my $input = do { local $/; <DATA> };
my %data = $input =~ /$re/g;

for my $k ( sort keys %data ) {
    printf "| %-10s | %6.2f |\n", $k, $data{$k};
}

__DATA__
<span class="title"></span><span class="price">0.25</span><span class="title">Banana</span><span class="price">0.10</span><span class="title">Grape</span><span class="price">0.05</span>

Output

| </span><span class="price">0.25</span><span class="title">Banana |   0.10 |
| Grape      |   0.05 |

Perhaps you intended to use following regular expression

use strict;
use warnings;

my $re = qr!<span class="title">([^<]+?)</span><span class="price">(.*?)</span>!;

my $input = do { local $/; <DATA> };
my %data = $input =~ /$re/g;

for my $k ( sort keys %data ) {
    printf "| %-10s | %6.2f |\n", $k, $data{$k};
}

__DATA__
<span class="title"></span><span class="price">0.25</span><span class="title">Banana</span><span class="price">0.10</span><span class="title">Grape</span><span class="price">0.05</span>

Output

| Banana     |   0.10 |
| Grape      |   0.05 |

So, if you chosen an approach to utilize grep and sed then command perhaps would take following shape

echo "<span class=\"title\"></span><span class=\"price\">0.25</span><span class=\"title\">Banana</span><span class=\"price\">0.10</span><span class=\"title\">Grape</span><span class=\"price\">0.05</span>" | grep -ioP "<span class=\"title\">([^<]+?)</span><span class=\"price\">(.+?)</span>" | sed "s/<span class=\"title\">//g; s/<span class=\"price\">/|/g; s/<\/span>//g;"

Output

Banana|0.10
Grape|0.05

If perl available in your system perhaps it would be easier to utilize it's power.

Polar Bear
  • 6,162
  • 1
  • 4
  • 11
0

@PolarBear Success! With your guidance, I finally figured out the optimal solution for my particular issue, still making use of the original non-greedy scope regex match (.+?), which was to include additional leading tags that uniquely identified the specific groups I was targeting while excluding those that did not match. Appreciate your assistance and positive feedback.

  • Gary -- I suggest to look at following [document](https://www.regular-expressions.info/lookaround.html) and particularly for lookahead and lookbehind regex patterns. Perhaps next _Stackoverflow_ [question](https://stackoverflow.com/questions/2973436/regex-lookahead-lookbehind-and-atomic-groups) will provide you with some useful information which can be used in your case. – Polar Bear May 31 '21 at 19:49