check if javascript string is valid UTF-8

Question

A user can copy and paste into a textarea html input and sometimes is pasting invalid UTF-8 characters, for example, a copy and paste from a rtf file that contains tabs.

How can I check if a string is a valid UTF-8?

may be help you http://stackoverflow.com/questions/20639052/check-if-the-bytes-sequence-is-valid-utf-8-sequence-in-javascript — Hadi J, Mar 30 '16 at 17:02
Looks like similar to [Validating user's UTF-8 name in Javascript](http://stackoverflow.com/questions/6381752/validating-users-utf-8-name-in-javascript) — Abhijit, Mar 30 '16 at 17:03

score 6 · Answer 1 · edited May 23 '17 at 10:32

I think you misunderstand what "UTF-8 characters" means. UTF-8 is an encoding of Unicode which can represent pretty-much every single character and glyph that has ever existed in recorded human history, so that extent there are no "invalid" UTF-8 characters.

RTF is a formatting system which works independently of the underlying encoding system - you can use RTF with ASCII, UTF-8, UTF-16 and others. Textboxes in HTML only respect plain text, so any RTF formatting will be automatically stripped (unless you're using a "rich-edit" component, which I assume you're not).

But you do describe things like whitespace characters (like tabs: \t) are represented in Unicode (and so, UTF-8). A string containing those characters is still "valid UTF-8", it's just invalid as far as your business-requirements are concerned.

I suggest just stripping-out unwanted characters using a regular-expression that matches non-visible characters (from here: Match non printable/non ascii characters and remove from text )

textBoxContent = textBoxContent.replace(/[^\x20-\x7E]+/g, '');

The expression [^\x20-\x7E] matches any character NOT in the codepoint range 0x20 (32, a normal space character ' ') to 0x7E (127, the tidle '~' character), all others will be removed.

Unicode's first 127 codepoints are identical to ASCII and can be seen here: http://www.asciitable.com/

To correct some misconceptions in this answer, too: there is no such thing as UTF8 "characters"; as an encoding scheme there are "UTF8 byte sequences", encoding Unicode code points, and these byte sequences can *absolutely* suffer from illegal values in the byte sequence. Similarly, Unicode as the formal mapping of "orthographic constructs" to numerical codes *also* has certain numbers that may not be used. Encountering a UTF8 byte stream with an illegal byte sequence, or a decoded Unicode sequence containing illegal numbers, is entirely possible, so: yes, there are "invalid UTF-8 characters". — Mike 'Pomax' Kamermans, Apr 14 '16 at 00:47

score 2 · Answer 2 · answered Jan 04 '18 at 12:33

2

Just an idea:

function checkUTF8(text) {
    var utf8Text = text;
    try {
        // Try to convert to utf-8
        utf8Text = decodeURIComponent(escape(text));
        // If the conversion succeeds, text is not utf-8
    }catch(e) {
        // console.log(e.message); // URI malformed
        // This exception means text is utf-8
    }   
    return utf8Text; // returned text is always utf-8
}

answered Jan 04 '18 at 12:33

Daniel Rodriguez

102
5

4

`escape` is deprecated and should not be used (because it can't handle Unicode properly) – Quentin Jan 04 '18 at 12:37

check if javascript string is valid UTF-8

2 Answers2

Linked