6

i have a VB.NET program that handles the content of documents. The programm handles high volumes of documents as "batch"(>2Million documents;total 1TB volume) Some of this documents may contain control chars or chars like f0e8(http://www.fileformat.info/info/unicode/char/f0e8/browsertest.htm).

Is there a easy and especially fast way to remove that chars?(except space,newline,tab,...) If the answer is regex: Has anyone a complete regex for me?

Thanks!

Joel Coehoorn
  • 380,066
  • 110
  • 546
  • 781
Mimefilt
  • 650
  • 8
  • 20
  • 3
    What's the problem with the control characters? I'm assuming that they are appropriate for the documents themselves. – Lazarus Dec 21 '10 at 15:30
  • The program uses different parsers(word,pdf,...) and deals with plain/text and xml files. Sometimes the (extracted) "body"/content string still contains annoying chars like "f0e8". So I have to remove them myself – Mimefilt Dec 21 '10 at 15:35
  • http://www.utf8-chartable.de/unicode-utf8-table.pl?start=61568&number=512 says that f0e8 is a utf8 char or am i wrong? – Mimefilt Dec 21 '10 at 15:48
  • Yes the extractor doens't remove all "design" chars.But I can't change it – Mimefilt Dec 21 '10 at 16:47
  • For future reference see section "Unicode Character Properties" here: http://www.regular-expressions.info/unicode.html – Geoffrey Dec 21 '10 at 18:06

2 Answers2

17

Try

resultString = Regex.Replace(subjectString, "\p{C}+", "");

This will remove all "other" Unicode characters (control, format, private use, surrogate, and unassigned) from your string.

Tim Pietzcker
  • 313,408
  • 56
  • 485
  • 544
0

Here is the POSIX regex for control characters: [:cntrl:], from Regular Expression on Wikipedia.

Geoffrey
  • 5,297
  • 8
  • 42
  • 76