Facebook JSON badly encoded

Question

I downloaded my Facebook messenger data (in your Facebook account, go to settings, then to Your Facebook information, then Download your information, then create a file with at least the Messages box checked) to do some cool statistics

However there is a small problem with encoding. I'm not sure, but it looks like Facebook used bad encoding for this data. When I open it with text editor I see something like this: Rados\u00c5\u0082aw. When I try to open it with python (UTF-8) I get RadosÅ\x82aw. However I should get: Radosław.

My python script:

text = open(os.path.join(subdir, file), encoding='utf-8')
conversations.append(json.load(text))

I tried a few most common encodings. Example data is:

{
  "sender_name": "Rados\u00c5\u0082aw",
  "timestamp": 1524558089,
  "content": "No to trzeba ostatnie treningi zrobi\u00c4\u0087 xD",
  "type": "Generic"
}

Why do you assume that the data is UTF-8 ? If you don't know its encoding, have you tried other reasonable possibilities e.g. windows 1250 or ISO 8859-2? — Peteris, Apr 24 '18 at 18:21
I tried a few of them. None worked. I encountered this question asked ealier: https://stackoverflow.com/questions/19161501/reading-json-what-encoding-is-u00c5-u0082-how-do-i-get-it-to-a-unicode-obje however I have no idea how to make it work for me — Jakub Jendryka, Apr 24 '18 at 18:28
no idea if it helps, but emojies encoding seeems to be funky in facebooks api: https://stackoverflow.com/questions/20045268/how-does-facebook-encode-emoji-in-the-json-graph-api — Patrick Artner, Apr 24 '18 at 18:44
@JakubJendryka: right, I'm not familiar with that system and perhaps there is indeed a mojibake in there; UTF-8 data being decoded as Latin-1 and then encoded as JSON. — Martijn Pieters, Apr 24 '18 at 18:50
@Patrick: that’s pretty much ancient history by now. We no longer use that encoding (and that only applies to Emoji). — Martijn Pieters, Apr 24 '18 at 23:39
This one did it for me: https://stackoverflow.com/a/5396742/2297366 — Dylan Vander Berg, Jan 18 '20 at 04:29
[For those using .NET C# solution](https://stackoverflow.com/a/50803989/396337) — Zyo, Feb 23 '21 at 13:24

Martijn Pieters · Accepted Answer · 2018-04-24T23:29:09.613

I can indeed confirm that the Facebook download data is incorrectly encoded; a Mojibake. The original data is UTF-8 encoded but was decoded as Latin -1 instead. I’ll make sure to file a bug report.

In the meantime, you can repair the damage in two ways:

Decode the data as JSON, then re-encode any strings as Latin-1, decode again as UTF-8:

>>> import json
>>> data = r'"Rados\u00c5\u0082aw"'
>>> json.loads(data).encode('latin1').decode('utf8')
'Radosław'

Load the data as binary, replace all \u00hh sequences with the byte the last two hex digits represent, decode as UTF-8 and then decode as JSON:

import re
from functools import partial

fix_mojibake_escapes = partial(
     re.compile(rb'\\u00([\da-f]{2})').sub,
     lambda m: bytes.fromhex(m.group(1).decode()))

with open(os.path.join(subdir, file), 'rb') as binary_data:
    repaired = fix_mojibake_escapes(binary_data.read())
data = json.loads(repaired.decode('utf8'))

From your sample data this produces:

{'content': 'No to trzeba ostatnie treningi zrobić xD',
 'sender_name': 'Radosław',
 'timestamp': 1524558089,
 'type': 'Generic'}

@Alper: without a sample of data that causes that error, I don't think I can help there, sorry. — Martijn Pieters, Feb 12 '19 at 20:12

score 10 · Answer 2 · answered Sep 06 '20 at 20:16

10

Here is a command-line solution with jq and iconv. Tested on Linux.

cat message_1.json | jq . | iconv -f utf8 -t latin1 > m1.json

answered Sep 06 '20 at 20:16

luksan

156
1
5

Why do you need the `jq .`? It will just pretty-print the original file – jjmerelo Sep 08 '21 at 05:52
2

@jjmerelo You need it to convert escaped characters to their raw b0rked forms. – che Sep 14 '21 at 22:47

Geekmoss · Answer 3 · 2019-02-26T09:37:04.747

My solution for parsing objects use parse_hook callback on load/loads function:

import json


def parse_obj(dct):
    for key in dct:
        dct[key] = dct[key].encode('latin_1').decode('utf-8')
        pass
    return dct


data = '{"msg": "Ahoj sv\u00c4\u009bte"}'

# String
json.loads(data)  
# Out: {'msg': 'Ahoj svÄ\x9bte'}
json.loads(data, object_hook=parse_obj)  
# Out: {'msg': 'Ahoj světe'}

# File
with open('/path/to/file.json') as f:
     json.load(f, object_hook=parse_obj)
     # Out: {'msg': 'Ahoj světe'}
     pass

Update:

Solution for parsing list with strings does not working. So here is updated solution:

import json


def parse_obj(obj):
    for key in obj:
        if isinstance(obj[key], str):
            obj[key] = obj[key].encode('latin_1').decode('utf-8')
        elif isinstance(obj[key], list):
            obj[key] = list(map(lambda x: x if type(x) != str else x.encode('latin_1').decode('utf-8'), obj[key]))
        pass
    return obj

Thank you so much! This issue has been driving me insane and your solution worked perfectly — laurencevs, Jul 14 '20 at 02:00

hotigeftas · Answer 4 · 2020-06-03T07:51:52.857

I would like to extend @Geekmoss' answer with the following recursive code snippet, I used to decode my facebook data.

import json

def parse_obj(obj):
    if isinstance(obj, str):
        return obj.encode('latin_1').decode('utf-8')

    if isinstance(obj, list):
        return [parse_obj(o) for o in obj]

    if isinstance(obj, dict):
        return {key: parse_obj(item) for key, item in obj.items()}

    return obj

decoded_data = parse_obj(json.loads(file))

I noticed this works better, because the facebook data you download might contain list of dicts, in which case those dicts would be just returned 'as is' because of the lambda identity function.

Ondrej Sotolar · Answer 5 · 2019-09-10T10:20:02.057

Based on @Martijn Pieters solution, I wrote something similar in Java.

public String getMessengerJson(Path path) throws IOException {
    String badlyEncoded = Files.readString(path, StandardCharsets.UTF_8);
    String unescaped = unescapeMessenger(badlyEncoded);
    byte[] bytes = unescaped.getBytes(StandardCharsets.ISO_8859_1);
    String fixed = new String(bytes, StandardCharsets.UTF_8);
    return fixed;
}

The unescape method is inspired by the org.apache.commons.lang.StringEscapeUtils.

private String unescapeMessenger(String str) {
    if (str == null) {
        return null;
    }
    try {
        StringWriter writer = new StringWriter(str.length());
        unescapeMessenger(writer, str);
        return writer.toString();
    } catch (IOException ioe) {
        // this should never ever happen while writing to a StringWriter
        throw new UnhandledException(ioe);
    }
}

private void unescapeMessenger(Writer out, String str) throws IOException {
    if (out == null) {
        throw new IllegalArgumentException("The Writer must not be null");
    }
    if (str == null) {
        return;
    }
    int sz = str.length();
    StrBuilder unicode = new StrBuilder(4);
    boolean hadSlash = false;
    boolean inUnicode = false;
    for (int i = 0; i < sz; i++) {
        char ch = str.charAt(i);
        if (inUnicode) {
            unicode.append(ch);
            if (unicode.length() == 4) {
                // unicode now contains the four hex digits
                // which represents our unicode character
                try {
                    int value = Integer.parseInt(unicode.toString(), 16);
                    out.write((char) value);
                    unicode.setLength(0);
                    inUnicode = false;
                    hadSlash = false;
                } catch (NumberFormatException nfe) {
                    throw new NestableRuntimeException("Unable to parse unicode value: " + unicode, nfe);
                }
            }
            continue;
        }
        if (hadSlash) {
            hadSlash = false;
            if (ch == 'u') {
                inUnicode = true;
            } else {
                out.write("\\");
                out.write(ch);
            }
            continue;
        } else if (ch == '\\') {
            hadSlash = true;
            continue;
        }
        out.write(ch);
    }
    if (hadSlash) {
        // then we're in the weird case of a \ at the end of the
        // string, let's output it anyway.
        out.write('\\');
    }
}

So I've spent some time trying out your Java solution, only needing to debug and learning that in the larger unescapeMessenger routine, at the top of the for loop, you have an *if (inUnicode)*, which you set to false right before the loop starts ... so nothing gets processed ... what's up with that? — Michael Sims, Apr 30 '20 at 18:14
But the for loop block doesn't end with the first conditional block. The inUnicode variable is set to true in the second conditional block if we are on the 'u' character of the '\u' prefix. — Ondrej Sotolar, May 06 '20 at 10:16
Well, it never worked for me, I parsed out the string a different way that was crude, but effective. — Michael Sims, May 13 '20 at 16:15

score 1 · Answer 6 · answered Feb 15 '21 at 11:55

Facebook programmers seem to have mixed up the concepts of Unicode encoding and escape sequences, probably while implementing their own ad-hoc serializer. Further details in Invalid Unicode encodings in Facebook data exports.

Try this:

import json
import io

class FacebookIO(io.FileIO):
    def read(self, size: int = -1) -> bytes:
        data: bytes = super(FacebookIO, self).readall()
        new_data: bytes = b''
        i: int = 0
        while i < len(data):
            # \u00c4\u0085
            # 0123456789ab
            if data[i:].startswith(b'\\u00'):
                u: int = 0
                new_char: bytes = b''
                while data[i+u:].startswith(b'\\u00'):
                    hex = int(bytes([data[i+u+4], data[i+u+5]]), 16)
                    new_char = b''.join([new_char, bytes([hex])])
                    u += 6

                char : str = new_char.decode('utf-8')
                new_chars: bytes = bytes(json.dumps(char).strip('"'), 'ascii')
                new_data += new_chars
                i += u
            else:
                new_data = b''.join([new_data, bytes([data[i]])])
                i += 1

        return new_data

if __name__ == '__main__':
    f = FacebookIO('data.json','rb')
    d = json.load(f)
    print(d)

score 0 · Answer 7 · answered Sep 23 '21 at 18:21

This is @Geekmoss' answer, but adapted for Python 3:

def parse_facebook_json(json_file_path):
    def parse_obj(obj):
        for key in obj:
            if isinstance(obj[key], str):
                obj[key] = obj[key].encode('latin_1').decode('utf-8')
            elif isinstance(obj[key], list):
                obj[key] = list(map(lambda x: x if type(x) != str else x.encode('latin_1').decode('utf-8'), obj[key]))
            pass
        return obj
    with json_file_path.open('rb') as json_file:
        return json.load(json_file, object_hook=parse_obj)

# Usage
parse_facebook_json(Path("/.../message_1.json"))

score 0 · Answer 8 · answered Nov 11 '21 at 09:59

Extending Martijn solution #1, that I see it can lead towards recursive object processing (It certainly lead me initially):

You can apply this to the whole string of json object, if you don't ensure_ascii

json.dumps(obj, ensure_ascii=False, indent=2).encode('latin-1').decode('utf-8')

then write it to file or something.

PS: This should be comment on @Martijn answer: https://stackoverflow.com/a/50011987/1309932 (but I can't add comments)

score 0 · Answer 9 · answered Nov 27 '21 at 09:09

This is my approach for Node 17.0.1, based on @hotigeftas recursive code, using the iconv-lite package.

import iconv from 'iconv-lite';

function parseObject(object) {
  if (typeof object == 'string') {
    return iconv.decode(iconv.encode(object, 'latin1'), 'utf8');;
  }

  if (typeof object == 'object') {
    for (let key in object) {
      object[key] = parseObject(object[key]);
    }
    return object;
  }

  return object;
}

//usage
let file = JSON.parse(fs.readFileSync(fileName));
file = parseObject(file);

Your answer could be improved by adding more information on what the code does and how it helps the OP. — Tyler2P, Nov 27 '21 at 10:41

Facebook JSON badly encoded

9 Answers9

Linked

Related