603

I would like to extract all the numbers contained in a string. Which is better suited for the purpose, regular expressions or the isdigit() method?

Example:

line = "hello 12 hi 89"

Result:

[12, 89]
pablouche
  • 6,081
  • 3
  • 15
  • 6
  • 4
    Unfortunately the sample input data was so simplistic, since such invited naive solutions. Common cases should handle input strings with more interesting characters adjacent to the digits. A slightly more challenging input: `'''gimme digits from "12", 34, '56', -789.'''` – MarkHu Aug 20 '20 at 19:27

18 Answers18

648

If you only want to extract only positive integers, try the following:

>>> txt = "h3110 23 cat 444.4 rabbit 11 2 dog"
>>> [int(s) for s in txt.split() if s.isdigit()]
[23, 11, 2]

I would argue that this is better than the regex example because you don't need another module and it's more readable because you don't need to parse (and learn) the regex mini-language.

This will not recognize floats, negative integers, or integers in hexadecimal format. If you can't accept these limitations, jmnas's answer below will do the trick.

GianAnge
  • 582
  • 2
  • 11
fmark
  • 54,196
  • 25
  • 97
  • 106
  • @Chris Morgan True. Not an apples to apples comparison with the regular expression though. I've changed the answer, but not the timings. – fmark Nov 27 '10 at 02:25
  • 10
    this will fail for case like "h3110 23 cat 444.4 rabbit 11-2 dog" – sharafjaffri Dec 04 '13 at 08:15
  • 11
    The normative case is using `re`. It is a general and powerful tool (so you learn something very useful). Speed is somewhat irrelevant in log parsing (it's not some intensive numerical solver after all), the `re` module is in the standard Python library and it doesn't hurt to load it. – 0 _ Apr 22 '14 at 07:27
  • Replace `int(s)` with `int(s.replace(',', ''))` and `s.isdigit()` with `s[0].isdigit()` and it'll handle numbers even if they have a comma as a thousands separator in it. This has a drawback that it'll fail on the `444.4` in your example, but if you're trying to handle such complicated input, maybe you'd be better off with a dedicated function rather than a one line thing like this. – ArtOfWarfare Feb 13 '15 at 18:23
  • 45
    I had strings like ``mumblejumble45mumblejumble`` in which I knew that there was only one number. The solution is simply ``int(filter(str.isdigit, your_string))``. – Jonas Lindeløv Aug 20 '15 at 09:57
  • 2
    A minor comment: you define the variable ``str`` which then overrides the ``str`` object and method in base python. That's not good practice since you might need it later in the script. – Jonas Lindeløv Aug 20 '15 at 09:58
  • 1
    Both python2 (2.7.10) and python3 (3.4.2) are almost two orders of magnitude faster here doing the `re` version. What version of python were you doing these tests with?! – Karl P Jul 12 '16 at 16:12
  • @Jonas, what did you mean by " I knew that there was only one number"? Your suggested solution also works for situation with more than one number in a string, e.g. **'plant_16_day_9_hour_9_label_fmp.png'**, gives: 1699. or could I be missing something? – Gathide Nov 19 '16 at 08:52
  • 1
    @Gathide, if this is your desired behavior, then perfect. But it's not necessarily and it wouldn't elicit a warning. I see that dfostic later posted a full answer with this strategy, also highlighting this behavior with a warning. – Jonas Lindeløv Nov 20 '16 at 20:42
  • 45
    `int(filter(...))` will raise `TypeError: int() argument must be a string...` for Python 3.5, so you can use updated version: `int(''.join(filter(str.isdigit, your_string)))` for extracting all digits to one integer. – Mark Mishyn Mar 21 '17 at 07:51
  • doesn't work: ``` >>> str = 'issue_date[200]' >>> [int(s) for s in str.split() if s.isdigit()] [] ``` – Julio Marins Apr 26 '17 at 23:19
  • "i am 18, but I can't drive" doesn't work because of the ,. Doesn't work on many corner cases. – thang Jun 22 '17 at 22:01
  • I run your benchmark and obtain totally different result: `python -m timeit -s "str = 'h3110 23 cat 444.4 rabbit 11 2 dog' * 1000" "[s for s in str.split() if s.isdigit()]" 1000 loops, best of 3: 595 usec per loop` and `python -m timeit -s "import re" "str = 'h3110 23 cat 444.4 rabbit 11 2 dog' * 1000" "re.findall('\\b\\d+\\b', str)" 100000 loops, best of 3: 12 usec per loop` – d21d3q Jul 27 '17 at 10:50
  • 1
    This is fine, but for cases where a number is NOT surrounded by spaces the solution (probably involving `filter(str.isdigit, string)`) becomes over-complicated. The smoothest way by far would be regex. Regex should be a part of every developer's arsenal. They're intimidating but easy to understand; it's counterproductive to avoid them too much. And the Pythonic way wouldn't be the fastest, it would be the cleanest and most understandable. – Agustín Lado Aug 15 '18 at 15:56
  • It wont take care if there is comma after the number "220 me xynz 345, 44 k" – KeshV Oct 27 '18 at 02:22
  • What if I wanted to find the indices of the digits in a string? – HackersInside Feb 22 '19 at 10:08
  • 1
    A variation on Jonas Lindelov's in the comments above, to get only the first number word: `firstNumberWord = next(filter(str.isdigit, myString.split()))` (In my case it was a sentence-like string, so the `.split()` separates it in words) – R. Navega Mar 19 '19 at 04:11
630

I'd use a regexp :

>>> import re
>>> re.findall(r'\d+', "hello 42 I'm a 32 string 30")
['42', '32', '30']

This would also match 42 from bla42bla. If you only want numbers delimited by word boundaries (space, period, comma), you can use \b :

>>> re.findall(r'\b\d+\b', "he33llo 42 I'm a 32 string 30")
['42', '32', '30']

To end up with a list of numbers instead of a list of strings:

>>> [int(s) for s in re.findall(r'\b\d+\b', "he33llo 42 I'm a 32 string 30")]
[42, 32, 30]
Gulzar
  • 17,272
  • 18
  • 86
  • 144
Vincent Savard
  • 32,695
  • 10
  • 65
  • 72
  • 9
    ... and then map `int` over it and you're done. +1 especially for the latter part. I'd suggest raw strings (`r'\b\d+\b' == '\\b\\d+\\b'`) though. –  Nov 27 '10 at 00:06
  • 6
    It could be put in a list with a generator, such as: `int_list = [int(s) for s in re.findall('\\d+', 'hello 12 hi 89')]` – GreenMatt Nov 27 '10 at 00:19
  • 7
    @GreenMatt: that is technically a list comprehension (not a generator), but I would agree that comprehensions/generators are more Pythonic than `map`. – Seth Johnson Nov 27 '10 at 01:23
  • 1
    @Seth Johnson: Oops! You're right, I mistyped in what was apparently a fogged state of mind. :-( Thanks for the correction! – GreenMatt Nov 28 '10 at 14:57
  • re.findall(r'\d+', 'hello 42 I\'m a 32 string 30') gives me numbers in a list not strings. – denson Sep 07 '16 at 02:44
  • Or use `map` to convert the strings to integers as in `map(int, re.findall(r'\b\d+\b', 'he33llo 42 I\'m a 32 string 30'))` – David Arenburg Sep 04 '17 at 11:40
  • 3
    I have a problem though. What if I want to extract float numbers also like 1.45 in "hello1.45 hi". It will give me 1 and 45 as two different numbers – ab123 May 24 '18 at 05:17
104

This is more than a bit late, but you can extend the regex expression to account for scientific notation too.

import re

# Format is [(<string>, <expected output>), ...]
ss = [("apple-12.34 ba33na fanc-14.23e-2yapple+45e5+67.56E+3",
       ['-12.34', '33', '-14.23e-2', '+45e5', '+67.56E+3']),
      ('hello X42 I\'m a Y-32.35 string Z30',
       ['42', '-32.35', '30']),
      ('he33llo 42 I\'m a 32 string -30', 
       ['33', '42', '32', '-30']),
      ('h3110 23 cat 444.4 rabbit 11 2 dog', 
       ['3110', '23', '444.4', '11', '2']),
      ('hello 12 hi 89', 
       ['12', '89']),
      ('4', 
       ['4']),
      ('I like 74,600 commas not,500', 
       ['74,600', '500']),
      ('I like bad math 1+2=.001', 
       ['1', '+2', '.001'])]

for s, r in ss:
    rr = re.findall("[-+]?[.]?[\d]+(?:,\d\d\d)*[\.]?\d*(?:[eE][-+]?\d+)?", s)
    if rr == r:
        print('GOOD')
    else:
        print('WRONG', rr, 'should be', r)

Gives all good!

Additionally, you can look at the AWS Glue built-in regex

ignorance
  • 5,011
  • 5
  • 40
  • 78
  • 1
    As this is the only answer anyone likes, here is how to do it with Scientific notation "[-+]?\d+[\.]?\d*[Ee]?\d*". Or some variation. Have fun! – ignorance Nov 06 '15 at 15:12
  • Find there is an issue with the simplest case eg `s = "4"` returns no matches. Can re be edited to also take care of this? – batFINGER Oct 10 '16 at 13:03
  • 1
    nice but it doesn't handle commas (e.g. 74,600) – yekta Oct 11 '16 at 14:54
  • A more verbose group is `[+-]?\d*[\.]?\d*(?:(?:[eE])[+-]?\d+)?` This group does give some false positives (i.e. `+` is captured by itself sometimes), but is able to handle more forms, like `.001`, plus it doesn't combine numbers automatically (like in `s=2+1`) – DavisDude Mar 16 '17 at 16:34
  • @yekta Fixed it for you. – ignorance Aug 10 '17 at 19:33
  • @DavisDude Sort of fixed it for you. "1+2" -> ['1', '+2'], but that ought to be sufficient for most uses – ignorance Aug 10 '17 at 19:38
  • 47
    Ah yes, the obvious `[-+]?[.]?[\d]+(?:,\d\d\d)*[\.]?\d*(?:[eE][-+]?\d+)?` - so silly of me... how could I not think of that? – Przemek D Oct 04 '17 at 11:52
84

I'm assuming you want floats not just integers so I'd do something like this:

l = []
for t in s.split():
    try:
        l.append(float(t))
    except ValueError:
        pass

Note that some of the other solutions posted here don't work with negative numbers:

>>> re.findall(r'\b\d+\b', 'he33llo 42 I\'m a 32 string -30')
['42', '32', '30']

>>> '-3'.isdigit()
False
jmnas
  • 1,054
  • 6
  • 4
  • This finds positive and negative floats and integers. For just positive and negative integers, change `float` to `int`. – Hugo Jun 02 '15 at 12:34
  • 5
    For negative numbers: `re.findall("[-\d]+", "1 -2")` – rassa45 Sep 15 '15 at 19:03
  • Does is make any difference if we write `continue` instead of `pass` in the loop? – D. Jones Aug 15 '16 at 10:48
  • This catches more than just positive integers, but using split() will miss numbers that have currency symbols preceding the first digit with no space, which is common in financial documents – Marc Maxmeister Jun 02 '17 at 13:12
  • 1
    Does not work for floats that have no space with other characters, example : '4.5 k things' will work, '4.5k things' won't. – Jay D. Jun 21 '18 at 18:01
  • The split/append approach is elegant. Any concise way for it to pick up floats only i.e. number must have a decimal point? – John Curry Oct 06 '20 at 19:14
  • The proposed regex does not work for negative numbers – lalebarde Feb 06 '22 at 09:03
82

If you know it will be only one number in the string, i.e 'hello 12 hi', you can try filter.

For example:

In [1]: int(''.join(filter(str.isdigit, '200 grams')))
Out[1]: 200
In [2]: int(''.join(filter(str.isdigit, 'Counters: 55')))
Out[2]: 55
In [3]: int(''.join(filter(str.isdigit, 'more than 23 times')))
Out[3]: 23

But be carefull !!! :

In [4]: int(''.join(filter(str.isdigit, '200 grams 5')))
Out[4]: 2005
MendelG
  • 8,523
  • 3
  • 16
  • 34
dfostic
  • 1,428
  • 13
  • 8
  • 15
    In Python 3.6.3 I got `TypeError: int() argument must be a string, a bytes-like object or a number, not 'filter'` - fixing it by using `int("".join(filter(str.isdigit, '200 grams')))` – Kent Munthe Caspersen Apr 09 '18 at 08:56
  • 1
    This is a good approach but it does not work in cases where we have floating point numbers. like `6.00` it gives six-hundred as answer `600` – Muneeb Ahmad Khurram Nov 09 '21 at 12:21
29

I was looking for a solution to remove strings' masks, specifically from Brazilian phones numbers, this post not answered but inspired me. This is my solution:

>>> phone_number = '+55(11)8715-9877'
>>> ''.join([n for n in phone_number if n.isdigit()])
'551187159877'
Sidon
  • 1,187
  • 2
  • 10
  • 24
  • Nice and simple, and arguably more readable than the also-correct-but-less-well-known `filter()` function technique: `''.join(filter(str.isdigit, phone_number))` – MarkHu Aug 20 '20 at 19:43
  • 2
    Nice, but converting to list is unnecessary. It can be slightly improved as `''.join(n for n in phone_number if n.isdigit())`. – AnT Jul 21 '21 at 03:46
23
# extract numbers from garbage string:
s = '12//n,_@#$%3.14kjlw0xdadfackvj1.6e-19&*ghn334'
newstr = ''.join((ch if ch in '0123456789.-e' else ' ') for ch in s)
listOfNumbers = [float(i) for i in newstr.split()]
print(listOfNumbers)
[12.0, 3.14, 0.0, 1.6e-19, 334.0]
AndreiS
  • 264
  • 2
  • 2
  • 3
    Welcome to SO and thanks for posting an answer. It's always good practice to add some additional comments to your answer and why it solves the problem, rather than just posting a code snippet. – sebs Mar 29 '18 at 13:48
  • didnt work in my case. not much different from the answer above – oldboy Jul 06 '18 at 03:43
  • ValueError: could not convert string to float: 'e' and it doesn't work in some cases :( – Vilq Sep 06 '19 at 11:27
23

To catch different patterns it is helpful to query with different patterns.

Setup all the patterns that catch different number patterns of interest:

(finds commas) 12,300 or 12,300.00

'[\d]+[.,\d]+'

(finds floats) 0.123 or .123

'[\d]*[.][\d]+'

(finds integers) 123

'[\d]+'

Combine with pipe ( | ) into one pattern with multiple or conditionals.

(Note: Put complex patterns first else simple patterns will return chunks of the complex catch instead of the complex catch returning the full catch).

p = '[\d]+[.,\d]+|[\d]*[.][\d]+|[\d]+'

Below, we'll confirm a pattern is present with re.search(), then return an iterable list of catches. Finally, we'll print each catch using bracket notation to subselect the match object return value from the match object.

s = 'he33llo 42 I\'m a 32 string 30 444.4 12,001'

if re.search(p, s) is not None:
    for catch in re.finditer(p, s):
        print(catch[0]) # catch is a match object

Returns:

33
42
32
30
444.4
12,001
Community
  • 1
  • 1
  • This will also accept a number ending with a dot, like "30." You need something like that: "[\d]+[\,\d]*[\.]{0,1}[\d]+" – katamayros Jan 27 '21 at 12:16
21

Using Regex below is the way

lines = "hello 12 hi 89"
import re
output = []
#repl_str = re.compile('\d+.?\d*')
repl_str = re.compile('^\d+$')
#t = r'\d+.?\d*'
line = lines.split()
for word in line:
        match = re.search(repl_str, word)
        if match:
            output.append(float(match.group()))
print (output)

with findall re.findall(r'\d+', "hello 12 hi 89")

['12', '89']

re.findall(r'\b\d+\b', "hello 12 hi 89 33F AC 777")

['12', '89', '777']
Tiago Martins Peres
  • 12,598
  • 15
  • 77
  • 116
  • You should at least compile the regex if you're not using `findall()` – information_interchange Oct 18 '19 at 03:21
  • 2
    `repl_str = re.compile('\d+.?\d*')` should be: `repl_str = re.compile('\d+\.?\d*')` For a reproducible example using python3.7 `re.search(re.compile(r'\d+.?\d*'), "42G").group()` '42G' `re.search(re.compile(r'\d+\.?\d*'), "42G").group()` '42' – Alexis Lucattini Nov 10 '19 at 05:47
20

For phone numbers you can simply exclude all non-digit characters with \D in regex:

import re

phone_number = "(619) 459-3635"
phone_number = re.sub(r"\D", "", phone_number)
print(phone_number)

The r in r"\D" stands for raw string. It is necessary. Without it, Python will consider \D as an escape character.

Francisco Puga
  • 22,373
  • 4
  • 48
  • 60
Antonin GAVREL
  • 7,991
  • 4
  • 44
  • 60
9
line2 = "hello 12 hi 89"  # this is the given string 
temp1 = re.findall(r'\d+', line2) # find number of digits through regular expression
res2 = list(map(int, temp1))
print(res2)

Hi ,

you can search all the integers in the string through digit by using findall expression .

In the second step create a list res2 and add the digits found in string to this list

hope this helps

Regards, Diwakar Sharma

Diwakar SHARMA
  • 535
  • 7
  • 22
  • 2
    The provided answer was flagged for review as a Low Quality Post. Here are some guidelines for [How do I write a good answer?](https://stackoverflow.com/help/how-to-answer). This provided answer may be correct, but it could benefit from an explanation. Code only answers are not considered "good" answers. From [review](https://stackoverflow.com/review). – Trenton McKinney Oct 06 '19 at 00:36
7

This answer also contains the case when the number is float in the string

def get_first_nbr_from_str(input_str):
    '''
    :param input_str: strings that contains digit and words
    :return: the number extracted from the input_str
    demo:
    'ab324.23.123xyz': 324.23
    '.5abc44': 0.5
    '''
    if not input_str and not isinstance(input_str, str):
        return 0
    out_number = ''
    for ele in input_str:
        if (ele == '.' and '.' not in out_number) or ele.isdigit():
            out_number += ele
        elif out_number:
            break
    return float(out_number)
Menglong Li
  • 2,015
  • 13
  • 18
7

I am just adding this answer because no one added one using Exception handling and because this also works for floats

a = []
line = "abcd 1234 efgh 56.78 ij"
for word in line.split():
    try:
        a.append(float(word))
    except ValueError:
        pass
print(a)

Output :

[1234.0, 56.78]
Raghav
  • 183
  • 1
  • 2
  • 10
5

I am amazed to see that no one has yet mentioned the usage of itertools.groupby as an alternative to achieve this.

You may use itertools.groupby() along with str.isdigit() in order to extract numbers from string as:

from itertools import groupby
my_str = "hello 12 hi 89"

l = [int(''.join(i)) for is_digit, i in groupby(my_str, str.isdigit) if is_digit]

The value hold by l will be:

[12, 89]

PS: This is just for illustration purpose to show that as an alternative we could also use groupby to achieve this. But this is not a recommended solution. If you want to achieve this, you should be using accepted answer of fmark based on using list comprehension with str.isdigit as filter.

Moinuddin Quadri
  • 43,657
  • 11
  • 92
  • 117
3

The cleanest way i found:

>>> data = 'hs122 125 &55,58, 25'
>>> new_data = ''.join((ch if ch in '0123456789.-e' else ' ') for ch in data)
>>> numbers = [i for i in new_data.split()]
>>> print(numbers)
['122', '125', '55', '58', '25']

or this:

>>> import re
>>> data = 'hs122 125 &55,58, 25'
>>> numbers = re.findall(r'\d+', data)
>>> print(numbers)
['122', '125', '55', '58', '25']
2

@jmnas, I liked your answer, but it didn't find floats. I'm working on a script to parse code going to a CNC mill and needed to find both X and Y dimensions that can be integers or floats, so I adapted your code to the following. This finds int, float with positive and negative vals. Still doesn't find hex formatted values but you could add "x" and "A" through "F" to the num_char tuple and I think it would parse things like '0x23AC'.

s = 'hello X42 I\'m a Y-32.35 string Z30'
xy = ("X", "Y")
num_char = (".", "+", "-")

l = []

tokens = s.split()
for token in tokens:

    if token.startswith(xy):
        num = ""
        for char in token:
            # print(char)
            if char.isdigit() or (char in num_char):
                num = num + char

        try:
            l.append(float(num))
        except ValueError:
            pass

print(l)
ZacSketches
  • 353
  • 1
  • 5
  • 13
2

Since none of these dealt with real world financial numbers in excel and word docs that I needed to find, here is my variation. It handles ints, floats, negative numbers, currency numbers (because it doesn't reply on split), and has the option to drop the decimal part and just return ints, or return everything.

It also handles Indian Laks number system where commas appear irregularly, not every 3 numbers apart.

It does not handle scientific notation or negative numbers put inside parentheses in budgets -- will appear positive.

It also does not extract dates. There are better ways for finding dates in strings.

import re
def find_numbers(string, ints=True):            
    numexp = re.compile(r'[-]?\d[\d,]*[\.]?[\d{2}]*') #optional - in front
    numbers = numexp.findall(string)    
    numbers = [x.replace(',','') for x in numbers]
    if ints is True:
        return [int(x.replace(',','').split('.')[0]) for x in numbers]            
    else:
        return numbers
Marc Maxmeister
  • 3,607
  • 2
  • 33
  • 47
0

The best option I found is below. It will extract a number and can eliminate any type of char.

def extract_nbr(input_str):
    if input_str is None or input_str == '':
        return 0

    out_number = ''
    for ele in input_str:
        if ele.isdigit():
            out_number += ele
    return float(out_number)    
Alex M
  • 2,610
  • 7
  • 26
  • 33
Ajay Kumar
  • 326
  • 2
  • 2