39

Assuming I have a sting which is "a s d d" and htmlentities turns it into
"a s d d".

How to replace (using preg_replace) it without encoding it to entities?

I tried preg_replace('/[\xa0]/', '', $string);, but it's not working. I'm trying to remove those special characters from my string as I don't need them

What are possibilities beyond regexp?

Edit String I want to parse: http://pastebin.com/raw/7eNT9sZr
with function preg_replace('/[\r\n]+/', "[##]", $text)
for later implode("</p><p>", explode("[##]", $text))

My question is not exactly "how" to do this (since I could encode entities, remove entities i don't need and decode entities). But how to remove those with just str_replace or preg_replace.

Grzegorz
  • 3,443
  • 4
  • 27
  • 43
  • `htmlentities` is prevention against xss. If you want to render in browser, the &nbsp will be evaluated as space only. If not then there is no use of the function – georoot Nov 21 '16 at 16:13
  • 2
    do you want to replace the spaces or the ` `? – Joshua Nov 21 '16 at 16:14
  • @georoot htmlentities prevents bad HTML output (ie. it ensures that information is emitted, not data), XSS is just maliciously crafted bad data. – user2864740 Nov 21 '16 at 16:14
  • `$string` == `a s d d` or `a s d d`? – chris85 Nov 21 '16 at 16:14
  • `htmlentities("a s d d")` outputs `"a s d d"` – Grzegorz Nov 21 '16 at 16:16
  • @user2864740 exactly my point. You use `htmlentities` if you want to render in browser in which case &nbsp doesn't make any difference. If you don't want to render in browser there is no use of the function – georoot Nov 21 '16 at 16:16
  • @georoot The information in HTML of " " and " " is different. One is a space. One is a non-breaking space. Only a non-breaking space is encoded as " ", not a normal space. – user2864740 Nov 21 '16 at 16:16
  • It's not for displaying, its for storing in database only. Only solution i can come up with atm is htmlenitities > str_replace > entities_decode chain – Grzegorz Nov 21 '16 at 16:17
  • @Grzegorz Use SQL parameterized queries for "storing to the database". In any case the input data *already contains* a not-a-normal-space. – user2864740 Nov 21 '16 at 16:17
  • @Grzegorz what is the point of using perl expressions? Why not to use str_replace in this case? – Victor Rudkov Nov 21 '16 at 16:18
  • http://pastebin.com/raw/7eNT9sZr Here is string I want to make into html. Replace multiple \r\n (which are divided by \xa0) and make pretty html. – Grzegorz Nov 21 '16 at 16:20
  • 2
    I think he is looking for a way to remove the non-breaking spaces from the string WITHOUT turning them into HTML entities first. – simon Nov 21 '16 at 16:22
  • In what encoding is your string? Is it UTF-8? If yes, I would say that non-breakable space is `0xc2a0` there. – David Ferenczy Rogožan Nov 21 '16 at 16:22
  • It's utf8. Also tried `\xc2a0` nothing is working and I'm wondering **WHY**. I want to know how it works, not how to do this :) – Grzegorz Nov 21 '16 at 16:24

2 Answers2

74

Problem Explanation

The reason why it's not working is that you are specifying the non-breaking space incorrectly.

The proper code for the non-breaking space in the UTF-8 encoding is 0xC2A0, it consists of two bytes - 0xC2 (194) and 0xA0 (160), so technically, you're specifying only the half of the character's code.

A Bit of Theory

Legacy character encodings were using the constant number of bits to encode every character in their set. For example, the original ASCII encoding was using 7 bits per character, extended ASCII 8 bits.

The UTF-8 encoding is so-called variable width character encoding, which means that the number of bits used to represent individual characters is variable, in the case of UTF-8, character codes consist of one up to four (8 bit) bytes (octets). In general, similarly to the Huffman coding, more frequently used characters have shorter codes while more rare characters have longer codes. That helps reduce the data size of the average text.

Solution

You can replace all occurences of the UTF-8 non-breaking space in text using a simple (and fast) str_replace or using a more flexible regular expression, depending on your needs:

// faster solution
$regular_spaces = str_replace("\xc2\xa0", ' ', $original_string);

// more flexible solution
$regular_spaces = preg_replace('/\xc2\xa0/', ' ', $original_string);

Notes

Note that in case of str_replace, you have to use double quotes (") to enclose the search string because it doesn't understand the textual representation of character codes so it needs those codes to be converted into actual characters first. That's made automatically by PHP because strings enclosed in double quotes are being processed and special sequences (e.g. newline character \n, textual representation of character codes, etc.) are replaced by actual characters (e.g. 0x0A for \n in UTF-8) before the string value is being used.

In contrast, the preg_replace function itself understands the textual representation of character codes so you don't need PHP to convert them into actual characters and you can use apostrophes (single quotes, ') to enclose the search string in this case.

David Ferenczy Rogožan
  • 21,531
  • 8
  • 75
  • 67
14

Sanitize every type of white spaces.

preg_replace("/\s+/u", " ", $str);

https://stackoverflow.com/a/40264711/635364

FYI, PHP Sanitization filter_var() has no filter about these white spaces.

Jehong Ahn
  • 1,112
  • 11
  • 23