How to remove control chars from UTF8 string

Question

i have a VB.NET program that handles the content of documents. The programm handles high volumes of documents as "batch"(>2Million documents;total 1TB volume) Some of this documents may contain control chars or chars like f0e8(http://www.fileformat.info/info/unicode/char/f0e8/browsertest.htm).

Is there a easy and especially fast way to remove that chars?(except space,newline,tab,...) If the answer is regex: Has anyone a complete regex for me?

Thanks!

What's the problem with the control characters? I'm assuming that they are appropriate for the documents themselves. — Lazarus, Dec 21 '10 at 15:30
The program uses different parsers(word,pdf,...) and deals with plain/text and xml files. Sometimes the (extracted) "body"/content string still contains annoying chars like "f0e8". So I have to remove them myself — Mimefilt, Dec 21 '10 at 15:35
http://www.utf8-chartable.de/unicode-utf8-table.pl?start=61568&number=512 says that f0e8 is a utf8 char or am i wrong? — Mimefilt, Dec 21 '10 at 15:48
Yes the extractor doens't remove all "design" chars.But I can't change it — Mimefilt, Dec 21 '10 at 16:47
For future reference see section "Unicode Character Properties" here: http://www.regular-expressions.info/unicode.html — Geoffrey, Dec 21 '10 at 18:06

Tim Pietzcker · Accepted Answer · 2010-12-21T18:20:35.210

17

Try

resultString = Regex.Replace(subjectString, "\p{C}+", "");

This will remove all "other" Unicode characters (control, format, private use, surrogate, and unassigned) from your string.

edited Dec 21 '10 at 18:20

answered Dec 21 '10 at 16:04

Tim Pietzcker

313,408
56
485
544

Thank you very much :D Works well! I hop it won't slow down the process to much. – Mimefilt Dec 21 '10 at 16:46
Why is the @ not accepted in Visual Basic? I get the "expression expected" error at the @. – Geoffrey Dec 21 '10 at 17:31
1

Oops. I had overlooked the VB part, and my knee-jerk reaction to the .NET tag was to provide a C# code snippet. Will edit. Thanks! – Tim Pietzcker Dec 21 '10 at 18:20
Ist there an overview what chars are removed by "\p{C}+"? Thanks! – Mimefilt Dec 21 '10 at 20:01
1

Check out http://www.unicode.org/charts/, scroll down to the bottom and look at the rightmost column. – Tim Pietzcker Dec 21 '10 at 20:22
See here for a C# version: https://stackoverflow.com/a/40568888/430742 – Jpsy Mar 21 '22 at 14:32

score 0 · Answer 2 · answered Dec 21 '10 at 15:51

0

Here is the POSIX regex for control characters: [:cntrl:], from Regular Expression on Wikipedia.

answered Dec 21 '10 at 15:51

Geoffrey

5,297
8
42
76

4

Posix is quite dead, may it rest in pieces. – Hans Passant Dec 21 '10 at 17:06

How to remove control chars from UTF8 string

2 Answers2

Linked