Removing non numeric characters from a string

Question

I have been given the task to remove all non numeric characters including spaces from a either text file or string and then print the new result next to the old characters for example:

Before:

sd67637 8

After:

As i am a beginner i do not know where to start with this task. Please Help

Possible duplicate of [Remove characters except digits from string using Python?](https://stackoverflow.com/q/1450897/608639) — jww, Jun 30 '19 at 13:13
Try `user_input = "~1984-04/20_" ; dateCode = "".join(filter(str.isdigit,user_input)) ; print(dateCode)` --I got `19840420` — MarkHu, Nov 18 '21 at 20:31

score 104 · Accepted Answer · answered Jun 27 '13 at 07:52

104

The easiest way is with a regexp

import re
a = 'lkdfhisoe78347834 (())&/&745  '
result = re.sub('[^0-9]','', a)

print result
>>> '78347834745'

answered Jun 27 '13 at 07:52

mar mar

1,098
1
7
9

13

any way to leave decimals in? – Mark Jul 19 '17 at 20:54
5

Why not `[^\d]+`? – scry May 25 '19 at 23:36
1

@Mark This should work ```re.findall(r"[-+]?\d*\.\d+|\d+", "Over th44e same pe14.1riod of time, p-0.8rices also rose by 82.8p")```. That should extract floats and signed values as well. – Kiprono Elijah Koech Oct 16 '21 at 17:58
1

You can include decimals by adding them to the regex as `'[^0-9.]'` – Mxblsdl Dec 03 '21 at 17:05

score 27 · Answer 2 · edited Jun 27 '13 at 07:49

27

Loop over your string, char by char and only include digits:

new_string = ''.join(ch for ch in your_string if ch.isdigit())

Or use a regex on your string (if at some point you wanted to treat non-contiguous groups separately)...

import re
s = 'sd67637 8' 
new_string = ''.join(re.findall(r'\d+', s))
# 676378

Then just print them out:

print(old_string, '=', new_string)

edited Jun 27 '13 at 07:49

jamylak

120,885
29
225
225

answered Jun 27 '13 at 07:20

Jon Clements

132,101
31
237
267

4

This is nicer because it doesn't just apply to ascii – jamylak Jun 27 '13 at 07:50

score 10 · Answer 3 · edited Jun 20 '20 at 09:12

10

There is a builtin for this.

string.translate(s, table[, deletechars])

Delete all characters from s that are in deletechars (if present), and then translate the characters using table, which must be a 256-character string giving the translation for each character value, indexed by its ordinal. If table is None, then only the character deletion step is performed.

>>> import string
>>> non_numeric_chars = ''.join(set(string.printable) - set(string.digits))
>>> non_numeric_chars = string.printable[10:]  # more effective method. (choose one)
'sd67637 8'.translate(None, non_numeric_chars)
'676378'

Or you could do it with no imports (but there is no reason for this):

>>> chars = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'
>>> 'sd67637 8'.translate(None, chars)
'676378'

edited Jun 20 '20 at 09:12

Community

1
1

answered Jun 27 '13 at 07:36

Inbar Rose

39,034
24
81
124

This should be the top answer. – akhan Jan 11 '17 at 07:44
Not really `>>> 's.,d67637 8'.translate(None, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ')` yields `'.,676378'` – Darth Kotik Jan 11 '17 at 08:47
@DarthKotik good point, OP didn't mentioned anything about special characters, but that is easy to solve. Check my edit. – Inbar Rose Jan 11 '17 at 08:57
@InbarRose it'll work but as soon as you wan to use some cyrillic symbol or something Chinese it'll fail. This solution is good as far as you know exactly what set of symbols might present in your field which is not really good. – Darth Kotik Jan 11 '17 at 09:05
3

@DarthKotik OP had no mention of special characters or encoding. Regardless, string.translate can solve all of those problems with the correct input. Much like every problem, it should be solved one step at a time. And in Agile development there is no need for premature optimization. The question was simple, the answer is simple. If you want to get into minutia we will be here all day. – Inbar Rose Jan 11 '17 at 09:39
1

Not Python 3 compatible. Very outdated answer. – Megan Caithlyn Aug 29 '18 at 11:34
1

@InbarRose Please update answer for python 3 (https://stackoverflow.com/a/41708804/828885) – akhan Mar 04 '19 at 17:55

Saullo G. P. Castro · Answer 4 · 2019-07-24T10:54:20.760

1

You can use string.ascii_letters to identify your non-digits:

from string import *

a = 'sd67637 8'
a = a.replace(' ', '')

for i in ascii_letters:
    a = a.replace(i, '')

In case you want to replace a colon, use quotes " instead of colons '.

edited Jul 24 '19 at 10:54

answered Jun 27 '13 at 07:28

Saullo G. P. Castro

53,388
26
170
232

What about a colon? – jtlz2 Jul 24 '19 at 10:46
@jtlz2 then you use `a = a.replace("'", "")`, note the colon within quotes – Saullo G. P. Castro Jul 24 '19 at 10:53
1

' is not a colon, it's a single quote. : is a colon. And this answer only replaces [a-z] (ignoring case). And finally, why import * from string if you are only using ascii_letters? – tbm Jun 08 '20 at 15:22

score 1 · Answer 5 · answered Nov 29 '21 at 22:01

I would not use RegEx for this. It is a lot slower!

Instead let's just use a simple for loop.

TLDR;

This function will get the job done fast...

def filter_non_digits(string: str) -> str:
    result = ''
    for char in string:
        if char in '1234567890':
            result += char
    return result

The Explanation

Let's create a very basic benchmark to test a few different methods that have been proposed. I will test three methods...

For loop method (my idea).
List Comprehension method from Jon Clements' answer.
RegEx method from Moradnejad's answer.

# filters.py

import re

# For loop method
def filter_non_digits_for(string: str) -> str:
    result = ''
    for char in string:
        if char in '1234567890':
            result += char
    return result 


# Comprehension method
def filter_non_digits_comp(s: str) -> str:
    return ''.join(ch for ch in s if ch.isdigit())


# RegEx method
def filter_non_digits_re(string: str) -> str:
    return re.sub('[^\d]','', string)

Now that we have an implementation of each way of removing digits, let's benchmark each one.

Here is some very basic and rudimentary benchmark code. However, it will do the trick and give us a good comparison of how each method performs.

# tests.py

import time, platform
from filters import filter_non_digits_re,
                    filter_non_digits_comp,
                    filter_non_digits_for


def benchmark_func(func):
    start = time.time()
    # the "_" in the number just makes it more readable
    for i in range(100_000):
        func('afes098u98sfe')
    end = time.time()
    return (end-start)/100_000


def bench_all():
    print(f'# System ({platform.system()} {platform.machine()})')
    print(f'# Python {platform.python_version()}\n')

    tests = [
        filter_non_digits_re,
        filter_non_digits_comp,
        filter_non_digits_for,
    ]

    for t in tests:
        duration = benchmark_func(t)
        ns = round(duration * 1_000_000_000)
        print(f'{t.__name__.ljust(30)} {str(ns).rjust(6)} ns/op')


if __name__ == "__main__":
    bench_all()

Here is the output from the benchmark code.

# System (Windows AMD64)
# Python 3.9.8

filter_non_digits_re             2920 ns/op
filter_non_digits_comp           1280 ns/op
filter_non_digits_for             660 ns/op

As you can see the filter_non_digits_for() funciton is more than four times faster than using RegEx, and about twice as fast as the comprehension method. Sometimes simple is best.

score 0 · Answer 6 · answered Aug 05 '21 at 17:08

To extract Integers

Example: sd67637 8 ==> 676378

import re
def extract_int(x):
    return re.sub('[^\d]','', x)

To extract a single float/int number (possible decimal separator)

Example: sd7512.sd23 ==> 7512.23

import re
def extract_single_float(x):
    return re.sub('[^\d|\.]','', x)

To extract multiple float/float numbers

Example: 123.2 xs12.28 4 ==> [123.2, 12.28, 4]

import re
def extract_floats(x):
    return re.findall("\d+\.\d+", x)

score 0 · Answer 7 · answered Oct 16 '21 at 17:55

Adding into @MoradneJad . You can use the following code to extract integer values, floats and even signed values.

a = re.findall(r"[-+]?\d*\.\d+|\d+", "Over th44e same pe14.1riod of time, p-0.8rices also rose by 82.8p")

And then you can convert the list items to numeric data type effectively using map.

print(list(map(float, a)))

[44.0, 14.1, -0.8, 82.8]

Removing non numeric characters from a string

7 Answers7

TLDR;

The Explanation

To extract Integers

To extract a single float/int number (possible decimal separator)

To extract multiple float/float numbers

Linked

Related