Processing large JSON files in Python without running out of memory
If you need to process a large JSON file in Python, it’s very easy to run out of memory. Even if the raw data fits in memory, the Python representation can increase memory usage even more.
And that means either slow processing, as your program swaps to disk, or crashing when you run out of memory.
One common solution is streaming parsing, aka lazy parsing, iterative parsing, or chunked processing. Let’s see how you can apply this technique to JSON processing.
The problem: Python’s memory-inefficient JSON loading
For illustrative purposes, we’ll be using this JSON file, large enough at 24MB that it has a noticeable memory impact when loaded. It encodes a list of JSON objects (i.e. dictionaries), which look to be GitHub events, users doing things to repositories:
[{"id":"2489651045","type":"CreateEvent","actor":{"id":665991,"login":"petroav","gravatar_id":"","url":"https://api.github.com/users/petroav","avatar_url":"https://avatars.githubusercontent.com/u/665991?"},"repo":{"id":28688495,"name":"petroav/6.828","url":"https://api.github.com/repos/petroav/6.828"},"payload":{"ref":"master","ref_type":"branch","master_branch":"master","description":"Solution to homework and assignments from MIT's 6.828 (Operating Systems Engineering). Done in my spare time.","pusher_type":"user"},"public":true,"created_at":"2015-01-01T15:00:00Z"},
...
]
Our goal is to figure out which repositories a given user interacted with. Here’s a simple Python program that does so:
import json
with open("large-file.json", "r") as f:
data = json.load(f)
user_to_repos = {}
for record in data:
user = record["actor"]["login"]
repo = record["repo"]["name"]
if user not in user_to_repos:
user_to_repos[user] = set()
user_to_repos[user].add(repo)
The result is a dictionary mapping usernames to sets of repository names.
When we run this with the Fil memory profiler, here’s what we get:
Looking at peak memory usage, we see two main sources of allocation:
- Reading the file.
- Decoding the resulting bytes into Unicode strings.
And if we look at the implementation of the json
module in Python, we can see that the json.load()
just loads the whole file into memory before parsing!
def load(fp, *, cls=None, object_hook=None, parse_float=None,
parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):
"""Deserialize ``fp`` (a ``.read()``-supporting file-like object containing
a JSON document) to a Python object.
...
"""
return loads(fp.read(), ...)
So that’s one problem: just loading the file will take a lot of memory. In addition, there should be some usage from creating the Python objects. However, in this case they don’t show up at all, probably because peak memory is dominated by loading the file and decoding it from bytes to Unicode. That’s why actual profiling is so helpful in reducing memory usage and speeding up your software: the real bottlenecks might not be obvious.
Even if loading the file is the bottleneck, that still raises some questions. The original file we loaded is 24MB. Once we load it into memory and decode it into a text (Unicode) Python string, it takes far more than 24MB. Why is that?
A brief digression: Python’s string memory representation
Python’s string representation is optimized to use less memory, depending on what the string contents are.
First, every string has a fixed overhead.
Then, if the string can be represented as ASCII, only one byte of memory is used per character.
If the string uses more extended characters, it might end up using as many as 4 bytes per character.
We can see how much memory an object needs using sys.getsizeof()
:
>>> import sys
>>> s = "a" * 1000
>>> len(s)
1000
>>> sys.getsizeof(s)
1049
>>> s2 = "❄" + "a" * 999
>>> len(s2)
1000
>>> sys.getsizeof(s2)
2074
>>> s3 = "💵" + "a" * 999
>>> len(s3)
1000
>>> sys.getsizeof(s3)
4076
Notice how all 3 strings are 1000 characters long, but they use different amounts of memory depending on which characters they contain.
If you look at our large JSON file, it contains characters that don’t fit in ASCII. Because it’s loaded as one giant string, that whole giant string uses a less efficient memory representation.
A streaming solution
It’s clear that loading the whole JSON file into memory is a waste of memory. With a larger file, it would be impossible to load at all.
Given a JSON file that’s structured as a list of objects, we could in theory parse it one chunk at a time instead of all at once. The resulting API would probably allow processing the objects one at a time. And if we look at the algorithm we want to run, that’s just fine; the algorithm does not require all the data be loaded into memory at once. We can process the records one at a time.
Whatever term you want to describe this approach—streaming, iterative parsing, chunking, or reading on-demand—it means we can reduce memory usage to:
- The in-progress data, which should typically be fixed.
- The result data structure, which in our case shouldn’t be too large.
There are a number of Python libraries that support this style of JSON parsing; in the following example, I used the ijson
library.
import ijson
user_to_repos = {}
with open("large-file.json", "rb") as f:
for record in ijson.items(f, "item"):
user = record["actor"]["login"]
repo = record["repo"]["name"]
if user not in user_to_repos:
user_to_repos[user] = set()
user_to_repos[user].add(repo)
In the previous version, using the standard library, once the data is loaded we no longer to keep the file open. With this API the file has to stay open because the JSON parser is reading from the file on demand, as we iterate over the records.
The items()
API takes a query string that tells you which part of the record to return.
In this case, "item"
just means “each item in the top-level list we’re iterating over”; see the ijson
documentation for more details.
Here’s what memory usage looks like with this approach:
When it comes to memory usage, problem solved!
And as far as runtime performance goes, the streaming/chunked solution with ijson
actually runs slightly faster, though this won’t necessarily be the case for other datasets or algorithms.
Note: Whether or not any particular tool or technique will help depends on where the actual memory bottlenecks are in your software.
Need to identify the memory and performance bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production macOS and Linux, and with built-in Jupyter support.
Other approaches
As always, there are other solutions you can try:
- Pandas: Pandas has the ability to read JSON, and, in theory, it could do it in a more memory-efficient way for certain JSON layouts. In practice, for this example at least peak memory was much worse at 287MB, not including the overhead of importing Pandas.
- SQLite: The SQLite database can parse JSON, store JSON in columns, and query JSON (see the documentation). One could therefore load the JSON into a disk-backed database file, and run queries against it to extract only the relevant subset of the data. I haven’t measured this approach, but if you need to run multiple queries against the same JSON file, this might be a good path going forward; you can add indexes, too.
Finally, if you have control over the output format, there are ways to reduce the memory usage of JSON processing by switching to a more efficient representation. For example, switching from a single giant JSON list of objects to a JSON record per line, which means every decoded JSON record will only use a small amount of memory.