11

Is there an efficient way to strip out numbers from a string in python? Using nltk or base python?

Thanks, Ben

ben890
  • 1,260
  • 4
  • 20
  • 42

4 Answers4

37

Yes, you can use a regular expression for this:

import re
output = re.sub(r'\d+', '', '123hello 456world')
print output  # 'hello world'
Martin Konecny
  • 54,197
  • 19
  • 131
  • 151
14

str.translate should be efficient.

In [7]: 'hello467'.translate(None, '0123456789')
Out[7]: 'hello'

To compare str.translate against re.sub:

In [13]: %%timeit r=re.compile(r'\d')
output = r.sub('', my_str)
   ....: 
100000 loops, best of 3: 5.46 µs per loop

In [16]: %%timeit pass
output = my_str.translate(None, '0123456789')
   ....: 
1000000 loops, best of 3: 713 ns per loop
Robᵩ
  • 154,489
  • 17
  • 222
  • 296
  • The problem is: `str.translate` is a bit difficult to make both 2.x/3.x compatible :( – Jon Clements May 19 '15 at 01:02
  • 6
    So you'd need `my_str.translate({ord(ch): None for ch in '0123456789'})` in 3.x – Jon Clements May 19 '15 at 01:05
  • I wonder how long r.sub() takes? Say, under conditions where you want to do this over multiple strings and you've pre-compiled the regex. – Ross May 19 '15 at 01:26
  • @Ross - Judging from the code I put in my answer, 5.46µs. – Robᵩ May 19 '15 at 01:28
  • 2
    @Rob - Ah right, I missed that the first line is the set up line. Looking at some best/worst cases translate seems to perform much better at worst case scenarios. Using 'python -m timeit' I came across the following in favour of translate; `'123hello 456world' - x5.0` `'1234567890987654321012345678909876543210' - x17.0` `'5a$%&^@)9lhk45g08j%Gmj3g09jSDGjg0034k' - x9.0` `'hello world im your boss' - x 1.8` – Ross May 19 '15 at 01:48
6

Try re.

import re
my_str = '123hello 456world'
output = re.sub('[0-9]+', '', my_str)
optimcode
  • 322
  • 1
  • 11
  • 3
    You realize you just posted a duplicate answer right? – Ross May 19 '15 at 01:22
  • 1
    Actually no... I gave a different way and then used the example (my_str = '123hello 456world') to illustrate see my edits – optimcode May 19 '15 at 01:48
  • this is not a different way - you simply used the long form of the accepted answer - `\d` corresponds to `[0-9]` – xeruf Nov 10 '21 at 12:51
1

Here's a method using str.join(), str.isnumeric(), and a generator expression which will work in 3.x:

>>> my_str = '123Hello, World!4567'
>>> output = ''.join(c for c in my_str if not c.isnumeric())
>>> print(output)
Hello, World!
>>> 

This will also work in 2.x, if you use a unicode string:

>>> my_str = u'123Hello, World!4567'
>>> output = ''.join(c for c in my_str if not c.isnumeric())
>>> print(output)
Hello, World!
>>> 

Hmm. Throw in a paperclip and we'd have an episode of MacGyver.

Update

I know that this has been closed out as a duplicate, but here's a method that works for both Python 2 and Python 3:

>>> my_str = '123Hello, World!4567'
>>> output = ''.join(map(lambda c: '' if c in '0123456789' else c, my_str))
>>> print(output)
Hello, World!
>>>
Deacon
  • 3,397
  • 24
  • 52