Finding Emojis in Strings

Question

So I'm trying to find and replace emojis in strings. This is my approach with regexp so far.

const replaceEmojis = function (string) {
    String.prototype.regexIndexOf = function (regex, startpos) {
        const indexOf = this.substring(startpos || 0).search(regex);
        return (indexOf >= 0) ? (indexOf + (startpos || 0)) : indexOf;
    }
    // generate regexp
    let regexp;
    try {
        regexp = new RegExp('\\p{Emoji}', "gu");
    } catch (e) {
        //4 firefox <3
        regexp = new RegExp(`(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])`, 'g');
    }

    // get indices of all emojis
    function getIndicesOf(searchStr, str) {
        let index, indices = [];

        function getIndex(startIndex) {
            index = str.regexIndexOf(searchStr, startIndex);
            if (index === -1) return;
            indices.push(index);
            getIndex(index + 1)
        }

        getIndex(0);

        return indices;
    }

    const emojisAt = getIndicesOf(regexp, string);

    // replace emojis with SVGs
    emojisAt.forEach(index => {
        // got nothing here yet
        // const unicode = staticHTML.charCodeAt(index); //.toString(16);
    })

The problem with this is that I only get an array with indices where the emojis are in the string. But with only these indices I can't replace them because I don't know how many (UTF-16) bytes they take up. Also for replacing them I need to know what emoji it is I am replacing.

So, is there a way to also get the length of the emoji? Or is there a better (perhaps simpler) way than mine to replace emojis?

What exactly do you consider emoji? There are many thousands of these things. What is your goal? — Brad, Apr 14 '20 at 17:21
@Brad I mean all [Emojis in Unicode](http://unicode.org/emoji/charts/full-emoji-list.html) including skin tones, basically everything you select with the `/\p{Emoji}/gu` expression. I want to replace them with SVGs of emojis. The idea is to locate an emoji, get its unicode (not charcode) and replace it with an SVG. One (long) unicode however gets encoded with multiple UTF-16 char codes, so emojis are various UTF-16 characters long, so i need to know how "long" an emoji is to replace it. — drinking-code, Apr 14 '20 at 17:56

drinking-code · Answer 1 · 2020-04-15T20:17:47.423

Alright, so turns out I just had a little bit of a mental block.
To find the emojis I don't need to get the indices as WolverinDEV mentioned. Although just using string.replace with /\p{Emoji}/gu does't work as this breaks up e.g. ‍♂️ into ,, and ♂. So I tweaked the regexp to account for that: /[\p{Emoji}\u200d]+/gu. Now the emoji is returned in full because zero width joiner are included.
This is what I got (if anyone cares):

const replaceEmojis = function (string) {
    const emojis = string.match(/[\p{Emoji}\u200d]+/gu);
    // console.log(emojis);

    // replace emojis with SVGs
    emojis.forEach(emoji => {
        // get the unicodes of the emoji
        let unicode = "";

        function getNextChar(pointer) {
            const subUnicode = emoji.codePointAt(pointer);
            if (!subUnicode) return;
            unicode += '-' + subUnicode.toString(16);
            getNextChar(++pointer);
        }

        getNextChar(0);

        unicode = unicode.substr(1); // remove the beginning dash '-'
        console.log(unicode.toUpperCase());

        // replace emoji here
        // string = string.replace(emoji, `<svg src='path/to/svg/${unicode}.svg'>`)
    })

    return string;
}

This still needs work, e.g. as there are Low Surrogates in the outputted unicode, but fundamentally, this works.

EDIT:

First improvement:
You may don't need this but to get rid of low surrogate characters add a condition in getNextChar()

if (!(subUnicode >= 56320 && subUnicode <= 57343)) unicode += '-' + subUnicode.toString(16);

This only adds the character code if it isn't a low surrogate character.

Second improvement:
Add the variation selector 16 (U+FE0F) to the regexp to select more emojis en bloc:

/[\p{Emoji}\u200d\ufe0f]+/gu

To avoid matching against numbers in the string, use `\p{Extended_Pictographic}` instead (from https://stackoverflow.com/questions/18862256/how-to-detect-emoji-using-javascript). Also a list of test cases for ZWJ emoji sequences: https://unicode.org/emoji/charts/emoji-zwj-sequences.html. — Sentient, Feb 20 '21 at 00:04

score 0 · Answer 2 · answered Apr 14 '20 at 22:50

0

Well you've already a working RegExp so you could use String.replace:

string.replace(regexp, my_emojy => { 
    return "<an emoji was here>";
});

So you've no need at all to find any indices.

answered Apr 14 '20 at 22:50

WolverinDEV

1,101
6
17

Yes! Thanks you for your answer! I re-thought the hole thing and also saw that `string.match(regexp)` is an option to just get the emojis and then replacing them inside the `forEach` with `string.replace`. Although your solution is much simpler I'll do it separately because unfortunately I have to account for all the identifiers (skin tone, gender, connectors, variation selectors, etc.). – drinking-code Apr 14 '20 at 23:13

Finding Emojis in Strings

2 Answers2

EDIT: