1

I have looked at quite a number of related SO posts pertaining to this. I have this malformed string that contains unicode characters which I want to strip away.

string testString = "\0\u0001\0\0\0����\u0001\0\0\0\0\0\0\0\u0011\u0001\0\0\0\u0004\0\0\0\u0006\u0002\0\0\0\u0005The\u0006\u0003\0\0\0\u0017boy\u0006\u0004\0\0\0\tKicked\u0006\u0005\0\0\0\u0013the Ball\v";

I would like the following output:

The boy kicked the Ball

How can I achieve this?

I have looked at the below (With not much success):

  1. How can you strip non-ASCII characters from a string? (in C#)
  2. Converting unicode characters (C#) Testing
  3. How to Remove '\0' from a string in C#?
  4. Removing unwanted character from column (SQL Server related so not relevant in my question)
Harold_Finch
  • 642
  • 2
  • 10
  • 29
  • What's the actual source of `testString`? I assume it's not hard-coded like that in your real code. – Enigmativity Jun 26 '20 at 04:29
  • @Enigmativity I got this as a result of doing decryption on an encrypted byte[] array via RSA asymmetric encryption i.e string `testString = Encoding.UTF8.GetString(encryptedByteArray, 0, encryptedByteArray.Length);` gave me what I posted in the question. I only just changed the actual strings – Harold_Finch Jun 26 '20 at 04:38
  • Then I suspect that you need to get your decryption character encoding right. I don't think this is an issue of stripping the Unicode characters. You're doing two wrongs, which isn't a right. Can you please post you decrypted byte array? Then we can probably get your string cleanly without the need to strip anything. – Enigmativity Jun 26 '20 at 05:41

4 Answers4

1

testString = Regex.Replace(testString, @"[\u0000-\u0008\u000A-\u001F\u0100-\uFFFF]", "");

or

testString = Regex.Replace(testString, @"[^\t\r\n -~]", "");

JoelFan
  • 36,115
  • 32
  • 128
  • 201
1

I use this regular expression to filter out bad characters in a filename.

Regex.Replace(directory, "[^a-zA-Z0-9\\:_\- ]", "")
sep7696
  • 101
  • 9
0

Try this:

string s = "søme string";
s = Regex.Replace(s, @"[^\u0000-\u007F]+", string.Empty);

Hope it helps.

Raul Marquez
  • 905
  • 1
  • 9
  • 22
0

Why not instead of trying to remove the unicode chars, you just extract all ASCII chars:

var str = string.Join(" ",new Regex("[ -~]+").Matches(testString).Select(m=>m.Value));
JohanP
  • 4,790
  • 2
  • 20
  • 30