Faster, more memory-efficient Python JSON parsing with msgspec
If you need to process a large JSON file in Python, you want:
- Make sure you don’t use too much memory, so you don’t crash half-way through.
- Parse it as quickly as possible.
- Ideally, make sure the data is actually valid up-front, with the right structure, so you don’t blow up half-way through your analysis.
You can put together solutions with multiple libraries, of course.
Or, you can use msgspec
a new library that offers schemas, fast parsing, and some neat tricks to reduce memory usage, all in a single library.
A starting point: built-in json
and orjson
Let’s start by looking at two other libraries: the built-in json
module in Python, and the speedy orjson
library.
We’ll revisit the example from my article on streaming JSON parsing.
Specifically, we’re going to be parsing a ~25MB file that encodes a list of JSON objects (i.e. dictionaries), which look to be GitHub events, users doing things to repositories:
[{"id":"2489651045","type":"CreateEvent","actor":{"id":665991,"login":"petroav","gravatar_id":"","url":"https://api.github.com/users/petroav","avatar_url":"https://avatars.githubusercontent.com/u/665991?"},"repo":{"id":28688495,"name":"petroav/6.828","url":"https://api.github.com/repos/petroav/6.828"},"payload":{"ref":"master","ref_type":"branch","master_branch":"master","description":"Solution to homework and assignments from MIT's 6.828 (Operating Systems Engineering). Done in my spare time.","pusher_type":"user"},"public":true,"created_at":"2015-01-01T15:00:00Z"},
...
]
Our goal is to figure out which repositories a given user interacted with.
Here’s how you’d do it with the Python standard library built-in json
module:
import json
with open("large.json", "r") as f:
data = json.load(f)
user_to_repos = {}
for record in data:
user = record["actor"]["login"]
repo = record["repo"]["name"]
if user not in user_to_repos:
user_to_repos[user] = set()
user_to_repos[user].add(repo)
print(len(user_to_repos), "records")
And here is how you’d do it with orjson
, a two-line change:
import orjson
with open("large.json", "rb") as f:
data = orjson.loads(f.read())
user_to_repos = {}
for record in data:
# ... same as stdlib code ...
Here’s how much memory and time these two options take:
$ /usr/bin/time -f "RAM: %M KB, Elapsed: %E" python stdlib.py
5250 records
RAM: 136464 KB, Elapsed: 0:00.42
$ /usr/bin/time -f "RAM: %M KB, Elapsed: %E" python with_orjson.py
5250 records
RAM: 113676 KB, Elapsed: 0:00.28
Memory usage is similar, but orjson
is faster, at 280ms instead of 420ms.
Next, let’s consider msgspec
.
msgspec
: schema-based decoding and encoding for JSON
Here’s the corresponding code using msgspec
; as you can see, it’s somewhat different in its approach to parsing:
from msgspec.json import decode
from msgspec import Struct
class Repo(Struct):
name: str
class Actor(Struct):
login: str
class Interaction(Struct):
actor: Actor
repo: Repo
with open("large.json", "rb") as f:
data = decode(f.read(), type=list[Interaction])
user_to_repos = {}
for record in data:
user = record.actor.login
repo = record.repo.name
if user not in user_to_repos:
user_to_repos[user] = set()
user_to_repos[user].add(repo)
print(len(user_to_repos), "records")
This code is longer, and more verbose, because msgspec
allows you to define schemas for the records you’re parsing.
Quite usefully, you don’t have to have a schema for all the fields.
While the JSON records have plenty of fields (look at the example earlier to see all the data), we only tell msgspec
about the fields we actually care about.
Here’s the result of parsing with msgspec
:
$ /usr/bin/time -f "RAM: %M KB, Elapsed: %E" python with_msgspec.py
5250 records
RAM: 38612 KB, Elapsed: 0:00.09
Much faster, and much less memory.
To summarize the three options we’ve seen, as well as a streaming ijson
-based solution:
Package | Time | RAM | Fixed memory use | Schema |
---|---|---|---|---|
Stdlib json |
420ms | 136MB | ❌ | ❌ |
orjson |
280ms | 114MB | ❌ | ❌ |
ijson |
300ms | 14MB | ✓ | ❌ |
msgspec |
90ms | 39MB | ❌ | ✓ |
The streaming solution only ever uses a fixed amount of memory for the parsing; all the other solutions have memory usage that scales with the size of the input.
But of those three, msgspec
has significantly lower memory usage, and it is by far the fastest solution.
Note: Whether or not any particular tool or technique will speed things up depends on where the bottlenecks are in your software.
Need to identify the performance and memory bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production on macOS and Linux, and with built-in Jupyter support.
The pros and cons of schema-based parsing
Because msgspec
allows you to specify the schema, we were able to create Python objects for only those fields that we actually cared about.
That meant lower RAM usage and faster decoding; no need to waste time or memory creating thousands of Python objects we were never going to look at.
We also got schema validation for free. If one of the records somehow was missing a field, or if the value was the wrong type, like an integer instead of a string, the parser would have complained. With standard JSON libraries, schema validation has to happen separately.
On the other hand:
- Memory usage when decoding still scales with the input file.
Streaming JSON parsers like
ijson
still offer the benefit of fixed memory usage during parsing, no matter how large the input file. - Specifying the schema involves more coding, and less flexibility to deal with imperfect data.
Learning more about msgspec
msgspec
has additional features, like encoding, MessagePack support (a faster alternative format to JSON), and more.
If you’re parsing JSON files on a regular basis, and you’re hitting performance or memory issues, or you just want built-in schemas, consider giving it a try.