Normalize whitespace with Python

Question

I'm building a data extract using scrapy and want to normalize a raw string pulled out of an HTML document. Here's an example string:

  Sapphire RX460 OC  2/4GB

Notice two groups of two whitespaces preceeding the string literal and between OC and 2.

Python provides trim as described in How do I trim whitespace with Python? But that won't handle the two spaces between OC and 2, which I need collapsed into a single space.

I've tried using normalize-space() from XPath while extracting data with my scrapy Selector and that works but the assignment verbose with strong rightward drift:

product_title = product.css('h3').xpath('normalize-space((text()))').extract_first()

Is there an elegant way to normalize whitespace using Python? If not a one-liner, is there a way I can break the above line into something easier to read without throwing an indentation error, e.g.

product_title = product.css('h3')
    .xpath('normalize-space((text()))')
    .extract_first()

score 29 · Accepted Answer · answered Sep 30 '17 at 09:37

29

You can use:

" ".join(s.split())

where s is your string.

answered Sep 30 '17 at 09:37

Tom Karzes

20,042
2
16
36

score 4 · Answer 2 · answered Sep 30 '17 at 09:44

4

Instead of using regex's for this, a more efficient solution is to use the join/split option, observe:

>>> timeit.Timer((lambda:' '.join(' Sapphire RX460 OC  2/4GB'.split()))).timeit()
0.7263979911804199

>>> def f():
        return re.sub(" +", ' ', "  Sapphire RX460 OC  2/4GB").split()

>>> timeit.Timer(f).timeit()
4.163465976715088

answered Sep 30 '17 at 09:44

hd1

32,598
5
75
87

I will circle back on this answer as my extracts grow in size. Thank you!! – vhs Sep 30 '17 at 09:52
Pleasure's all mine. – hd1 Sep 30 '17 at 11:07

score 0 · Answer 3 · answered Sep 30 '17 at 09:25

0

You can use a function like below with regular expression to scan for continuous spaces and replace them by 1 space

import re

def clean_data(data):
    return re.sub(" {2,}", " ", data.strip())

product_title = clean(product.css('h3::text').extract_first())

And then improve clean function anyway you like it

answered Sep 30 '17 at 09:25

Tarun Lalwani

133,941
8
173
238

Not as elegant as what I was looking for, but points for extensibility. – vhs Sep 30 '17 at 09:39

Normalize whitespace with Python

3 Answers3

Linked