How to get the equivalent file size of a json object in Python?

Question

I am writing a python object with json.dump.

But I want to only write objects that would not exceed 10KB file size.

How can estimate the size of an object before writing?

DocZerø · Answer 1 · 2022-04-05T09:28:46.107

Here's my take on it.

We start with the following sample (taken from here, with a little extra added to make it interesting):

sample_orig = """{
    "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    },
    "a little extra" : "∫ßåøπœ®†"
}"""

Next, we define a test function to perform the encoding and output the size:

def encode_sample(sample : str):
    for encoding in ('ascii', 'utf8', 'utf16'):
        filename = f'{encoding}.json'
        encoded_sample = sample.encode(encoding=encoding, errors='replace')
        with FileIO(filename, mode='wb') as f:
            f.write(encoded_sample)
            assert len(encoded_sample) == f.tell()
            print(f'{encoding}: {f.tell()} bytes')

The assert will prove that the size reported by len is the same if we're dealing with a bytes (not str). If not, the call will raise AssertionError.

We'll encode the original sample first:

encode_sample(sample_orig)

Output:

ascii: 617 bytes
utf8: 627 bytes
utf16: 1236 bytes

Next, we run it through json.loads() and json.dumps() to "optimise" the size (i.e. remove unnecessary whitespace):

sample_reduced = json.dumps(json.loads(sample_orig))
encode_sample(sample_reduced)

Output:

ascii: 455 bytes
utf8: 455 bytes
utf16: 912 bytes

Remarks:

The OP asked "[…] writing a python object with json.dump", so the "optimisation" by removing whitespace doesn't really matter, but I left it in as it might benefit others.
Encoding matters. ascii and utf8 (the default) will result in the same filesize, if the output only contains ASCII characters. Because I added a little extra at the end of the JSON, the file sizes for both encodings differ. And utf16 will of course be the largest of the three.
As stated before, you can use len to get the size of the object if you encode it first.

Abhinav Mathur · Answer 2 · 2022-04-05T09:22:50.580

-2

Convert the JSON to a string, then use sys.getsizeof(). It returns the size in bytes, so you can divide by 1024 if you want to compare it to a threshold value in kilobytes.

sys.getsizeof(json.dumps(object))

Sample usage:

import json
import sys
x = '{"name":"John", "age":30, "car":null}'
y = json.loads(x)
print(sys.getsizeof(json.dumps(y))) # 89

Edit:
As mentioned in this thread, objects have a higher size in memory. So subtract 49 from the size to get a better estimate.

print(sys.getsizeof(json.dumps(y)) - sys.getsizeof(""))

edited Apr 05 '22 at 09:22

answered Apr 05 '22 at 08:16

Abhinav Mathur

5,837
2
8
22

1

`sys.getsizeof()` tells you the size of the object in memory which is not the same as its file size. – martineau Apr 05 '22 at 08:53
@martineau you're right, added documentation and fixed it – Abhinav Mathur Apr 05 '22 at 09:00
1

My point is that `getsizeof()` has no bearing on the problem (so doesn't need to be used). The correct way to do this is to use `len(json.dumps(obj))`. – martineau Apr 05 '22 at 09:05

Edo Akse · Answer 3 · 2022-04-05T09:28:24.320

-3

len(json.dumps(object)) is the answer after testing it.

I used the code below on an existing JSON file, which is 69961 bytes in size according to my file browser, and the code below.

import json
import os


with open("somejson.json") as infile:
    data = json.load(infile)
print(len(json.dumps(data)))


with open("somejson.json", "w") as outfile:
    json.dump(data, outfile)
print(os.path.getsize("somejson.json"))

output

69961
69961

edited Apr 05 '22 at 09:28

answered Apr 05 '22 at 08:29

Edo Akse

2,882
2
9
16

1

@martineau also comparing size before and after does not make sense because json.load() and json.dump() can change the formatting of actual json representation .. – rasjani Apr 05 '22 at 09:15
@rasjani, I'm not following that part, unless you add `indent=x` to the `json.dump()`. Could you please explain? – Edo Akse Apr 05 '22 at 09:22
the comparison is just to show that `len(json.dumps(data))` gives the exact same result as the final file size, as OP states that he uses `json.dump()` is his code... – Edo Akse Apr 05 '22 at 09:26
@EdoAkse your example shows "somejson.json" .. It could have indentation *before* you load the file, it might also use windows type of linefeeds, both of these can be stripped when doing unserialize/serialization cycle resulting *different* pre / post states for `somejson.json` .. Comment i left was not about `len(json.dumps())` being incorrect, just about comparing filesize before and after like that can be problematic – rasjani Apr 05 '22 at 10:25
I'm just using the `somejson.json` as sample input. Indentation doesn't matter as I am not looking at the file before loading it with the `json.loads()`. I am comparing the size of the `data` object to the resulting file size. I admit, I should have just loaded a string, as both Abhinav and DocZerø did, to avoid confusion. – Edo Akse Apr 05 '22 at 10:40
Downvoted for opening file in text-mode with system default encoding, writing it with default encoding, and assuming all characters are 1 byte. While this may work for this specific example, it will not always work and is a bad practice that will cause problems for someone sometime. – wovano Apr 27 '22 at 18:16

How to get the equivalent file size of a json object in Python?

3 Answers3