How to fix UTF encoding for whitespaces?

Question

In my C# code, I am extracting text from a PDF document. When I do that, I get a string that's in UTF-8 or Unicode encoding (I'm not sure which). When I use Encoding.UTF8.GetBytes(src); to convert it into a byte array, I notice that the whitespace is actually two characters with byte values of 194 and 160.

For example the string "CLE action" looks like

[67, 76, 69, 194 ,160, 65 ,99, 116, 105, 111, 110]

in a byte array, where the whitespace is 194 and 160... And because of this src.IndexOf("CLE action"); is returning -1 when I need it to return 1.

How can I fix the encoding of the string?

score 32 · Accepted Answer · edited Aug 18 '16 at 16:44

32

194 160 is the UTF-8 encoding of a NO-BREAK SPACE codepoint (the same codepoint that HTML calls  ).

So it's really not a space, even though it looks like one. (You'll see it won't word-wrap, for instance.) A regular expression match for \s would match it, but a plain comparison with a space won't.

To simply replace NO-BREAK spaces you can do the following:

src = src.Replace('\u00A0', ' ');

edited Aug 18 '16 at 16:44

Gone Coding

90,552
24
176
195

answered Dec 21 '12 at 15:40

RichieHindle

258,929
46
350
392

How can I replace a non-breaking space with an ordinary space? – omega Dec 21 '12 at 16:30
5

@omega: src = src.Replace('\u00A0', ' '); – RichieHindle Dec 21 '12 at 16:33

score 3 · Answer 2 · answered Dec 21 '12 at 15:45

3

In UTF8 character value c2 a0 (194 160) is defined as NO-BREAK SPACE. According to ISO/IEC 8859 this is a space that does not allow a line break to be inserted. Normally text processing software assumes that a line break can be inserted at any white space character (this is how word wrap is normally implemented). You should be able to simply do a replace in your string of the characters with a normal space to fix the problem.

answered Dec 21 '12 at 15:45

Kevin

684
3
4

How can I write the string replace function? – omega Dec 21 '12 at 16:04
1

@omega: src = src.Replace('\u00A0', ' '); – RichieHindle Dec 21 '12 at 16:33

score 2 · Answer 3 · answered Dec 21 '12 at 15:40

Interpreting \xC2\xA0 (=194, 160) as UTF8 actually yields \xA0 which is unicode non-breaking space. This is a different character than ordinary space and thus, doesn't match ordinary spaces. You have to match against the non-breaking space or use fuzzy-matching against any whitespace.

How to fix UTF encoding for whitespaces?

3 Answers3

Linked