18

I've got a text input from a mobile device. It contains emoji. In C#, I have the text as

Text  text

Simply put, I want the output text to be

Text text

I'm trying to just remove all such emojis from the text with rejex.. except, I'm not sure how to convert that emoji into it's unicode sequence.. How do I do that?

edit:

I'm trying to save the user input into mysql. It looks like mysql UTF8 doesn't really support unicode characters and the right way to do it would be by changing the schema but I don't think that is an option for me. So I'm trying to just remove all the emoji characters before saving it in the database.

This is my schema for the relevant column:

enter image description here

I'm using Nhibernate as my ORM and the insert query generated looks like this:

Insert into `Content` (ContentTypeId, Comments, DateCreated) 
values (?p0, ?p1, ?p2);
?p0 = 4 [Type: Int32 (0)]. ?p1 = 'Text  text' [Type: String (20)], ?p2 = 19/01/2015 10:38:23 [Type: DateTime (0)]

When I copy this query from logs and run it on mysql directly, I get this error:

1 warning(s): 1366 Incorrect string value: '\xF0\x9F\x98\x80 t...' for column 'Comments' at row 1   0.000 sec

Also, I've tried to convert it into encoding bytes and it doesn't really work..

enter image description here

Community
  • 1
  • 1
LocustHorde
  • 6,065
  • 16
  • 58
  • 89
  • It's not really clear what you're trying to achieve - what would you do with the string value after replacing the characters? – Jon Skeet Jan 19 '15 at 11:34
  • @JonSkeet edited the post, thanks. – LocustHorde Jan 19 '15 at 11:38
  • UTF-8 really *should* be fine here. Can you post the details of how you're currently trying to save the data, along with your schema information? – Jon Skeet Jan 19 '15 at 11:41
  • 1
    See here: https://gist.github.com/adamlwatson/9623703 – Octopoid Jan 19 '15 at 11:41
  • (Assuming you actually want to remove them, rather than sort your encoding) – Octopoid Jan 19 '15 at 11:42
  • @JonSkeet added the info. – LocustHorde Jan 19 '15 at 11:58
  • 2
    @LocustHorde Which version of MySQL are you running on? Seemingly the character set utf8mb4 should make everything tikitiboo... have a read of the answer here http://stackoverflow.com/questions/24253985/mysql-utf-8-and-emoji-characters "It seems that MySQL supports two forms of unicode ucs2 which is 16-bits per character and utf8 up to 3 bytes per character. The bad news is that neither form is going to support plane 1 characters which require at 17 bits. (mainly emoji). It looks like MySQL 5.5.3 and up also support utf8mb4, utf16, and utf32 and supplementary characters (read emoji)" – Paul Zahra Jan 19 '15 at 12:00
  • You haven't actually shown the code you're using. The error message doesn't seem to fit with the UTF-8 encoding for either of those values, which is odd... – Jon Skeet Jan 19 '15 at 12:00
  • @JonSkeet yea, I was testing with a few emojis so the message is for another emoji. Also, not sure what you mean by code? I'm using a regular nhibernate repository that saves the object with `public virtual String Comments { get; set; }` property. The insert query produced is fine, it's just that mysql db can't handle the unicode. – LocustHorde Jan 19 '15 at 12:04
  • @PaulZahra I don't think changing the schema is an option, but will try talk to dba about it! what I need is something like what Octopid has mentioned, but in c#, but I just can't seem to be able to regex the emojis! – LocustHorde Jan 19 '15 at 12:08
  • 3
    Something to be aware of from http://stackoverflow.com/questions/10992921/how-to-remove-emoji-code-using-javascript "However, note that there are other characters in the Basic Multilingual Plane that are used as emoji by phones but which long predate emoji. For example U+2665 is the traditional Heart Suit character ♥, but it my be rendered as an emoji graphic on some devices. It's up to you whether you treat this as emoji and try to remove it." – Paul Zahra Jan 19 '15 at 12:32
  • 1
    Octopoid's gist doesn't convert them, it *removes* them. If you want to just remove any characters not in the BMP, that's reasonably easy. – Jon Skeet Jan 19 '15 at 12:46
  • @JonSkeet yup - I do want to just remove them! but to remove them I must regex match them and that's where I'm stuck now. – LocustHorde Jan 19 '15 at 13:23
  • "So convert to corresponding \uxxxx characters" is just a red herring? – Jon Skeet Jan 19 '15 at 13:30

1 Answers1

50

Assuming you just want to remove all non-BMP characters, i.e. anything with a Unicode code point of U+10000 and higher, you can use a regex to remove any UTF-16 surrogate code units from the string. For example:

using System;
using System.Text.RegularExpressions;

class Test
{
    static void Main(string[] args)
    {
        string text = "x\U0001F310y";
        Console.WriteLine(text.Length); // 4
        string result = Regex.Replace(text, @"\p{Cs}", "");
        Console.WriteLine(result); // 2
    }
}

Here "Cs" is the Unicode category for "surrogate".

It appears that Regex works based on UTF-16 code units rather than Unicode code points, otherwise you'd need a different approach.

Note that there are non-BMP characters other than emoji, but I suspect you'll find they'll have the same problem when you try to store them.

Jon Skeet
  • 1,335,956
  • 823
  • 8,931
  • 9,049
  • Hi, I made the question to describe what I thought was my problem.. but I tried out your answer and it turns out I don't actually need to convert them.. So I have edited the question now! http://i.imgur.com/NoQfxud.png Thank you! – LocustHorde Jan 19 '15 at 14:48
  • @LocustHorde: So long as you're aware that you're just throwing away bits of the user's input... – Jon Skeet Jan 19 '15 at 14:54
  • Yea! this is a temporary solution (hopefully short term!) – LocustHorde Jan 19 '15 at 15:04
  • Hi @JonSkeet, I'm trying to use your Regex to detect if emojis are included in a string (pretty much the exact same code). For some reason `\p{Cs}` does not catch all emojis. Do you know anything about this by any chance? I've tried about 30 of them and one or two were not detected. I'm assuming they're not in the range of that regex, but i'd like your expert opinion since I know nothing about surrogates and very little about chars in general – Gil Sand Oct 24 '17 at 07:43
  • @GilSand: Well, did you look at what Unicode categories those characters are in? It's probably best to ask a new question with a complete example, rather than "one or two of them" (leaving us guessing which). We can then look at what's going on much more easily. – Jon Skeet Oct 24 '17 at 07:49
  • @JonSkeet You're right. Here's a link to the new question for you or future travelers : https://stackoverflow.com/questions/46905176/detecting-all-emojis – Gil Sand Oct 24 '17 at 08:02