How to replace decoded Non-breakable space (nbsp)

Question

Assuming I have a sting which is "a s d d" and htmlentities turns it into
"a s d d".

How to replace (using preg_replace) it without encoding it to entities?

I tried preg_replace('/[\xa0]/', '', $string);, but it's not working. I'm trying to remove those special characters from my string as I don't need them

What are possibilities beyond regexp?

Edit String I want to parse: http://pastebin.com/raw/7eNT9sZr
with function preg_replace('/[\r\n]+/', "[##]", $text)
for later implode("</p><p>", explode("[##]", $text))

My question is not exactly "how" to do this (since I could encode entities, remove entities i don't need and decode entities). But how to remove those with just str_replace or preg_replace.

`htmlentities` is prevention against xss. If you want to render in browser, the &nbsp will be evaluated as space only. If not then there is no use of the function — georoot, Nov 21 '16 at 16:13
@georoot htmlentities prevents bad HTML output (ie. it ensures that information is emitted, not data), XSS is just maliciously crafted bad data. — user2864740, Nov 21 '16 at 16:14
@user2864740 exactly my point. You use `htmlentities` if you want to render in browser in which case &nbsp doesn't make any difference. If you don't want to render in browser there is no use of the function — georoot, Nov 21 '16 at 16:16
@georoot The information in HTML of " " and " " is different. One is a space. One is a non-breaking space. Only a non-breaking space is encoded as " ", not a normal space. — user2864740, Nov 21 '16 at 16:16
It's not for displaying, its for storing in database only. Only solution i can come up with atm is htmlenitities > str_replace > entities_decode chain — Grzegorz, Nov 21 '16 at 16:17
@Grzegorz Use SQL parameterized queries for "storing to the database". In any case the input data *already contains* a not-a-normal-space. — user2864740, Nov 21 '16 at 16:17
@Grzegorz what is the point of using perl expressions? Why not to use str_replace in this case? — Victor Rudkov, Nov 21 '16 at 16:18
http://pastebin.com/raw/7eNT9sZr Here is string I want to make into html. Replace multiple \r\n (which are divided by \xa0) and make pretty html. — Grzegorz, Nov 21 '16 at 16:20
I think he is looking for a way to remove the non-breaking spaces from the string WITHOUT turning them into HTML entities first. — simon, Nov 21 '16 at 16:22
In what encoding is your string? Is it UTF-8? If yes, I would say that non-breakable space is `0xc2a0` there. — David Ferenczy Rogožan, Nov 21 '16 at 16:22
It's utf8. Also tried `\xc2a0` nothing is working and I'm wondering **WHY**. I want to know how it works, not how to do this :) — Grzegorz, Nov 21 '16 at 16:24

David Ferenczy Rogožan · Accepted Answer · 2021-06-09T17:07:23.467

Problem Explanation

The reason why it's not working is that you are specifying the non-breaking space incorrectly.

The proper code for the non-breaking space in the UTF-8 encoding is 0xC2A0, it consists of two bytes - 0xC2 (194) and 0xA0 (160), so technically, you're specifying only the half of the character's code.

A Bit of Theory

Legacy character encodings were using the constant number of bits to encode every character in their set. For example, the original ASCII encoding was using 7 bits per character, extended ASCII 8 bits.

The UTF-8 encoding is so-called variable width character encoding, which means that the number of bits used to represent individual characters is variable, in the case of UTF-8, character codes consist of one up to four (8 bit) bytes (octets). In general, similarly to the Huffman coding, more frequently used characters have shorter codes while more rare characters have longer codes. That helps reduce the data size of the average text.

Solution

You can replace all occurences of the UTF-8 non-breaking space in text using a simple (and fast) str_replace or using a more flexible regular expression, depending on your needs:

// faster solution
$regular_spaces = str_replace("\xc2\xa0", ' ', $original_string);

// more flexible solution
$regular_spaces = preg_replace('/\xc2\xa0/', ' ', $original_string);

Notes

Note that in case of str_replace, you have to use double quotes (") to enclose the search string because it doesn't understand the textual representation of character codes so it needs those codes to be converted into actual characters first. That's made automatically by PHP because strings enclosed in double quotes are being processed and special sequences (e.g. newline character \n, textual representation of character codes, etc.) are replaced by actual characters (e.g. 0x0A for \n in UTF-8) before the string value is being used.

In contrast, the preg_replace function itself understands the textual representation of character codes so you don't need PHP to convert them into actual characters and you can use apostrophes (single quotes, ') to enclose the search string in this case.

Note that `str_replace()` will work as well and is much faster. — simon, Nov 21 '16 at 16:35
I had no idea I have to write `\xc2\xa0` and wrote `\xc2a0`... my fail. Thank you! — Grzegorz, Nov 21 '16 at 16:44
Maybe could you tell me how to replace it in group? `preg_replace('/[\x0E-\x1f]/', '', $string);`? — Grzegorz, Nov 21 '16 at 16:45
@Grzegorz I'm not sure what you mean by that. Do you mean how to say that the codes in square brackets (`[\xc2\xa0]`) are a single character and not two? — David Ferenczy Rogožan, Nov 21 '16 at 17:17
Encodings are not my strong point (utf8). For example I have `preg_replace('/[\x0E-\x1F\xc2\xa0]/', '', $string);` it would replace either `\xc2` and `\xa0` how to include it in regex so it only replaces `\xc2\xa0` and leaves `\xc2` intact? — Grzegorz, Nov 22 '16 at 06:31
Sorry, I'm not sure about that. Did you solve it already? If not, you can probably create a new question to address that. — David Ferenczy Rogožan, Mar 14 '17 at 13:03
@DavidFerenczyRogožan Is it possible to trim both these `\xc2\xa0` and space with the same `trim` function? like `trim($str, "\xC2\xA0")` or is there any way to do like that? — Saroj Shrestha, Nov 07 '20 at 13:31

score 14 · Answer 2 · answered May 29 '20 at 08:57

14

Sanitize every type of white spaces.

preg_replace("/\s+/u", " ", $str);

https://stackoverflow.com/a/40264711/635364

FYI, PHP Sanitization filter_var() has no filter about these white spaces.

answered May 29 '20 at 08:57

Jehong Ahn

1,112
11
23

2

This is definitely the best option and should be the selected answer. – Moritz Friedrich Feb 06 '22 at 16:56
1

The only answer that worked for me! – user3382203 Apr 08 '22 at 08:29

How to replace decoded Non-breakable space (nbsp)

2 Answers2

Problem Explanation

A Bit of Theory

Solution

Notes

Linked

Related