48

I downloaded my Facebook messenger data (in your Facebook account, go to settings, then to Your Facebook information, then Download your information, then create a file with at least the Messages box checked) to do some cool statistics

However there is a small problem with encoding. I'm not sure, but it looks like Facebook used bad encoding for this data. When I open it with text editor I see something like this: Rados\u00c5\u0082aw. When I try to open it with python (UTF-8) I get RadosÅ\x82aw. However I should get: Radosław.

My python script:

text = open(os.path.join(subdir, file), encoding='utf-8')
conversations.append(json.load(text))

I tried a few most common encodings. Example data is:

{
  "sender_name": "Rados\u00c5\u0082aw",
  "timestamp": 1524558089,
  "content": "No to trzeba ostatnie treningi zrobi\u00c4\u0087 xD",
  "type": "Generic"
}
Martijn Pieters
  • 963,270
  • 265
  • 3,804
  • 3,187
Jakub Jendryka
  • 549
  • 1
  • 4
  • 6
  • Why do you assume that the data is UTF-8 ? If you don't know its encoding, have you tried other reasonable possibilities e.g. windows 1250 or ISO 8859-2? – Peteris Apr 24 '18 at 18:21
  • I tried a few of them. None worked. I encountered this question asked ealier: https://stackoverflow.com/questions/19161501/reading-json-what-encoding-is-u00c5-u0082-how-do-i-get-it-to-a-unicode-obje however I have no idea how to make it work for me – Jakub Jendryka Apr 24 '18 at 18:28
  • no idea if it helps, but emojies encoding seeems to be funky in facebooks api: https://stackoverflow.com/questions/20045268/how-does-facebook-encode-emoji-in-the-json-graph-api – Patrick Artner Apr 24 '18 at 18:44
  • 1
    @JakubJendryka: right, I'm not familiar with that system and perhaps there is indeed a mojibake in there; UTF-8 data being decoded as Latin-1 and then encoded as JSON. – Martijn Pieters Apr 24 '18 at 18:50
  • @Patrick: that’s pretty much ancient history by now. We no longer use that encoding (and that only applies to Emoji). – Martijn Pieters Apr 24 '18 at 23:39
  • This one did it for me: https://stackoverflow.com/a/5396742/2297366 – Dylan Vander Berg Jan 18 '20 at 04:29
  • [For those using .NET C# solution](https://stackoverflow.com/a/50803989/396337) – Zyo Feb 23 '21 at 13:24

9 Answers9

60

I can indeed confirm that the Facebook download data is incorrectly encoded; a Mojibake. The original data is UTF-8 encoded but was decoded as Latin -1 instead. I’ll make sure to file a bug report.

In the meantime, you can repair the damage in two ways:

  1. Decode the data as JSON, then re-encode any strings as Latin-1, decode again as UTF-8:

    >>> import json
    >>> data = r'"Rados\u00c5\u0082aw"'
    >>> json.loads(data).encode('latin1').decode('utf8')
    'Radosław'
    
  2. Load the data as binary, replace all \u00hh sequences with the byte the last two hex digits represent, decode as UTF-8 and then decode as JSON:

    import re
    from functools import partial
    
    fix_mojibake_escapes = partial(
         re.compile(rb'\\u00([\da-f]{2})').sub,
         lambda m: bytes.fromhex(m.group(1).decode()))
    
    with open(os.path.join(subdir, file), 'rb') as binary_data:
        repaired = fix_mojibake_escapes(binary_data.read())
    data = json.loads(repaired.decode('utf8'))
    

    From your sample data this produces:

    {'content': 'No to trzeba ostatnie treningi zrobić xD',
     'sender_name': 'Radosław',
     'timestamp': 1524558089,
     'type': 'Generic'}
    
Martijn Pieters
  • 963,270
  • 265
  • 3,804
  • 3,187
10

Here is a command-line solution with jq and iconv. Tested on Linux.

cat message_1.json | jq . | iconv -f utf8 -t latin1 > m1.json

luksan
  • 156
  • 1
  • 5
6

My solution for parsing objects use parse_hook callback on load/loads function:

import json


def parse_obj(dct):
    for key in dct:
        dct[key] = dct[key].encode('latin_1').decode('utf-8')
        pass
    return dct


data = '{"msg": "Ahoj sv\u00c4\u009bte"}'

# String
json.loads(data)  
# Out: {'msg': 'Ahoj svÄ\x9bte'}
json.loads(data, object_hook=parse_obj)  
# Out: {'msg': 'Ahoj světe'}

# File
with open('/path/to/file.json') as f:
     json.load(f, object_hook=parse_obj)
     # Out: {'msg': 'Ahoj světe'}
     pass

Update:

Solution for parsing list with strings does not working. So here is updated solution:

import json


def parse_obj(obj):
    for key in obj:
        if isinstance(obj[key], str):
            obj[key] = obj[key].encode('latin_1').decode('utf-8')
        elif isinstance(obj[key], list):
            obj[key] = list(map(lambda x: x if type(x) != str else x.encode('latin_1').decode('utf-8'), obj[key]))
        pass
    return obj
Geekmoss
  • 577
  • 5
  • 10
6

I would like to extend @Geekmoss' answer with the following recursive code snippet, I used to decode my facebook data.

import json

def parse_obj(obj):
    if isinstance(obj, str):
        return obj.encode('latin_1').decode('utf-8')

    if isinstance(obj, list):
        return [parse_obj(o) for o in obj]

    if isinstance(obj, dict):
        return {key: parse_obj(item) for key, item in obj.items()}

    return obj

decoded_data = parse_obj(json.loads(file))

I noticed this works better, because the facebook data you download might contain list of dicts, in which case those dicts would be just returned 'as is' because of the lambda identity function.

hotigeftas
  • 131
  • 1
  • 3
  • 9
1

Based on @Martijn Pieters solution, I wrote something similar in Java.

public String getMessengerJson(Path path) throws IOException {
    String badlyEncoded = Files.readString(path, StandardCharsets.UTF_8);
    String unescaped = unescapeMessenger(badlyEncoded);
    byte[] bytes = unescaped.getBytes(StandardCharsets.ISO_8859_1);
    String fixed = new String(bytes, StandardCharsets.UTF_8);
    return fixed;
}

The unescape method is inspired by the org.apache.commons.lang.StringEscapeUtils.

private String unescapeMessenger(String str) {
    if (str == null) {
        return null;
    }
    try {
        StringWriter writer = new StringWriter(str.length());
        unescapeMessenger(writer, str);
        return writer.toString();
    } catch (IOException ioe) {
        // this should never ever happen while writing to a StringWriter
        throw new UnhandledException(ioe);
    }
}

private void unescapeMessenger(Writer out, String str) throws IOException {
    if (out == null) {
        throw new IllegalArgumentException("The Writer must not be null");
    }
    if (str == null) {
        return;
    }
    int sz = str.length();
    StrBuilder unicode = new StrBuilder(4);
    boolean hadSlash = false;
    boolean inUnicode = false;
    for (int i = 0; i < sz; i++) {
        char ch = str.charAt(i);
        if (inUnicode) {
            unicode.append(ch);
            if (unicode.length() == 4) {
                // unicode now contains the four hex digits
                // which represents our unicode character
                try {
                    int value = Integer.parseInt(unicode.toString(), 16);
                    out.write((char) value);
                    unicode.setLength(0);
                    inUnicode = false;
                    hadSlash = false;
                } catch (NumberFormatException nfe) {
                    throw new NestableRuntimeException("Unable to parse unicode value: " + unicode, nfe);
                }
            }
            continue;
        }
        if (hadSlash) {
            hadSlash = false;
            if (ch == 'u') {
                inUnicode = true;
            } else {
                out.write("\\");
                out.write(ch);
            }
            continue;
        } else if (ch == '\\') {
            hadSlash = true;
            continue;
        }
        out.write(ch);
    }
    if (hadSlash) {
        // then we're in the weird case of a \ at the end of the
        // string, let's output it anyway.
        out.write('\\');
    }
}
Ondrej Sotolar
  • 1,090
  • 1
  • 16
  • 26
  • So I've spent some time trying out your Java solution, only needing to debug and learning that in the larger unescapeMessenger routine, at the top of the for loop, you have an *if (inUnicode)*, which you set to false right before the loop starts ... so nothing gets processed ... what's up with that? – Michael Sims Apr 30 '20 at 18:14
  • But the for loop block doesn't end with the first conditional block. The inUnicode variable is set to true in the second conditional block if we are on the 'u' character of the '\u' prefix. – Ondrej Sotolar May 06 '20 at 10:16
  • Well, it never worked for me, I parsed out the string a different way that was crude, but effective. – Michael Sims May 13 '20 at 16:15
1

Facebook programmers seem to have mixed up the concepts of Unicode encoding and escape sequences, probably while implementing their own ad-hoc serializer. Further details in Invalid Unicode encodings in Facebook data exports.

Try this:

import json
import io

class FacebookIO(io.FileIO):
    def read(self, size: int = -1) -> bytes:
        data: bytes = super(FacebookIO, self).readall()
        new_data: bytes = b''
        i: int = 0
        while i < len(data):
            # \u00c4\u0085
            # 0123456789ab
            if data[i:].startswith(b'\\u00'):
                u: int = 0
                new_char: bytes = b''
                while data[i+u:].startswith(b'\\u00'):
                    hex = int(bytes([data[i+u+4], data[i+u+5]]), 16)
                    new_char = b''.join([new_char, bytes([hex])])
                    u += 6

                char : str = new_char.decode('utf-8')
                new_chars: bytes = bytes(json.dumps(char).strip('"'), 'ascii')
                new_data += new_chars
                i += u
            else:
                new_data = b''.join([new_data, bytes([data[i]])])
                i += 1

        return new_data

if __name__ == '__main__':
    f = FacebookIO('data.json','rb')
    d = json.load(f)
    print(d)
kravietz
  • 9,761
  • 2
  • 31
  • 25
0

This is @Geekmoss' answer, but adapted for Python 3:

def parse_facebook_json(json_file_path):
    def parse_obj(obj):
        for key in obj:
            if isinstance(obj[key], str):
                obj[key] = obj[key].encode('latin_1').decode('utf-8')
            elif isinstance(obj[key], list):
                obj[key] = list(map(lambda x: x if type(x) != str else x.encode('latin_1').decode('utf-8'), obj[key]))
            pass
        return obj
    with json_file_path.open('rb') as json_file:
        return json.load(json_file, object_hook=parse_obj)

# Usage
parse_facebook_json(Path("/.../message_1.json"))
nicbou
  • 1,024
  • 10
  • 16
0

Extending Martijn solution #1, that I see it can lead towards recursive object processing (It certainly lead me initially):

You can apply this to the whole string of json object, if you don't ensure_ascii

json.dumps(obj, ensure_ascii=False, indent=2).encode('latin-1').decode('utf-8')

then write it to file or something.

PS: This should be comment on @Martijn answer: https://stackoverflow.com/a/50011987/1309932 (but I can't add comments)

danbaragan
  • 11
  • 1
  • 2
0

This is my approach for Node 17.0.1, based on @hotigeftas recursive code, using the iconv-lite package.

import iconv from 'iconv-lite';

function parseObject(object) {
  if (typeof object == 'string') {
    return iconv.decode(iconv.encode(object, 'latin1'), 'utf8');;
  }

  if (typeof object == 'object') {
    for (let key in object) {
      object[key] = parseObject(object[key]);
    }
    return object;
  }

  return object;
}

//usage
let file = JSON.parse(fs.readFileSync(fileName));
file = parseObject(file);
  • Your answer could be improved by adding more information on what the code does and how it helps the OP. – Tyler2P Nov 27 '21 at 10:41