0

I have a paragraph/sentence from which I want to identify

  1. any series of number 6 digits or more
  2. any series of numbers with a "-" (dash)

but I don't want to identify

  1. any numbers preceded by a $(dollar)
  2. any series of numbers with , (comma)

How can I achieve this?

The regex I tried is: r'(?:\s|^)(\d-?(\s)?){6,}(?=[?\s]|$)' but its not accurate.

I'm looking for these patterns inside a paragraph

  • 123-456-789
  • 123-456
  • 123 456
  • 123 456 789 It may also contain full stop(.) at the end too but it should ignore the following patterns

  • $123654

  • $ 123654
  • 12,4569
  • 123*123*7732
  • 123h434k5454
The fourth bird
  • 127,136
  • 16
  • 45
  • 63

1 Answers1

1

You could match what you don't want and capture in a group what you want to keep.

Using re.findall the group 1 values will be returned.

Afterwards you might filter out the empty strings.

(?<!\S)(?:\$\s*\d+(?:\,\d+)?|(\d+(?:[ -]\d+)+\.?|\d{3,}))(?!\S)

In parts

  • (?<!\S) Assert a whitespace boundary on the left
  • (?: Non capture group
    • \$\s* Match a dollar sign, 0+ whitespace chars
    • \d+(?:\,\d+)? Match 1+ digits with an optional comma digits part
    • | Or
    • ( Capture group 1
      • \d+ Match 1+ digits
      • (?:[ -]\d+)+\.? Repeat a space or - 1+ times followed by an optional .
      • | Or
      • \d{3,} Match 3 or more digits (Or use {6,} for 6 or more
    • ) Close group 1
  • ) Close non capture group
  • (?!\S) Assert a whitespace boundary on the right

Regex demo | Python demo | Another Python demo

For example

import re

regex = r"(?<!\S)(?:\$\s*(?:\d+(?:\,\d+)?)|(\d+(?:[ -]\d+)+\.?|\d{3,}))(?!\S)"

test_str = ("123456\n"
    "1234567890\n"
    "12345\n\n"
    "12,123\n"
    "etc...)

print(list(filter(None, re.findall(regex, test_str))))

Output

['123456', '1234567890', '12345', '1-2-3', '123-456-789', '123-456-789.', '123-456', '123 456', '123 456 789', '123 456 789.', '123 456 123 456 789', '123', '456', '123', '456', '789']
The fourth bird
  • 127,136
  • 16
  • 45
  • 63
  • my current requirement is to use the result in `if(re.match(regex, field.value.text.lower())):` this returns all matches and groups.. I can not use the re.findall() here..I want only the group1 result in the re.match() – pforpraphul Apr 21 '20 at 07:38
  • [re.match](https://docs.python.org/3/library/re.html#re.match) returns a [match object](https://docs.python.org/3/library/re.html#match-objects) from which you can get the [group](https://docs.python.org/3/library/re.html#re.Match.group) – The fourth bird Apr 21 '20 at 07:44
  • You mean, `re.match(regex, field.value.text.lower()).group` like this? – pforpraphul Apr 21 '20 at 07:48
  • Like `.group(1)` See this page for an example https://stackoverflow.com/questions/2703029/why-isnt-the-regular-expressions-non-capturing-group-working – The fourth bird Apr 21 '20 at 07:53
  • gotcha! in our case, our matching results are in group(1), right? – pforpraphul Apr 21 '20 at 07:56
  • That is correct. Note that re.match matches `If zero or more characters at the beginning of string` Else you could look at [re.search](https://docs.python.org/3/library/re.html#re.search) – The fourth bird Apr 21 '20 at 07:59
  • If the answer helped solving the problem, feel free to [mark the answer](https://stackoverflow.com/tour) as accepted by clicking ✓ on the left of this answer. Note that you get 2 [reputation points](https://stackoverflow.com/help/whats-reputation) accepting a solution. – The fourth bird Apr 21 '20 at 08:01
  • How can I add the condition to add these type of patterns `+123456565 + 12345675` – pforpraphul Apr 21 '20 at 12:45
  • @PraphulNangeelil Hi there, sorry for the late reponse. You can do it like this https://regex101.com/r/yDzRU3/1 You can prepend a optional group with a `+` and an optional space before it `(?:\+ ?)?\d{3,}` – The fourth bird Apr 22 '20 at 08:13
  • if I want to add any other characters in the future, suppose I want to add @ -------- `(?:\+ ?)?(?:\@ ?)?\d{3,}` is this correct? `(? – pforpraphul Apr 22 '20 at 08:28
  • This part `(?:@ ?)?` will accept an `@` followed by an optional space. If you want to allow more characters, you could use a character class allowing any of the listed llike `(?:[@+] ?)?` – The fourth bird Apr 22 '20 at 08:34
  • but it fails when the patter is +123-456-565 – pforpraphul Apr 22 '20 at 08:46
  • 1
    If you want it for both the alternations, you could prepend it before the alternation https://regex101.com/r/73VCnN/1 else you have to add it per alternative what you would allow. https://regex101.com/r/tpKoXT/1 Note that you are extending the original question, and accounting for all the side effects will make the pattern larger. – The fourth bird Apr 22 '20 at 08:53
  • 1
    Yes,, the basic requirement has been satisfied by you. but these are the issues that I'm anticipating. that's why I clarified all my queries. Thanks a lot. You helped a lot. – pforpraphul Apr 22 '20 at 08:57