Faster, more memory-efficient Python JSON parsing with msgspec

If you need to process a large JSON file in Python, you want:

  1. Make sure you don’t use too much memory, so you don’t crash half-way through.
  2. Parse it as quickly as possible.
  3. Ideally, make sure the data is actually valid up-front, with the right structure, so you don’t blow up half-way through your analysis.

You can put together solutions with multiple libraries, of course. Or, you can use msgspec a new library that offers schemas, fast parsing, and some neat tricks to reduce memory usage, all in a single library.

A starting point: built-in json and orjson

Let’s start by looking at two other libraries: the built-in json module in Python, and the speedy orjson library. We’ll revisit the example from my article on streaming JSON parsing. Specifically, we’re going to be parsing a ~25MB file that encodes a list of JSON objects (i.e. dictionaries), which look to be GitHub events, users doing things to repositories:

[{"id":"2489651045","type":"CreateEvent","actor":{"id":665991,"login":"petroav","gravatar_id":"","url":"https://api.github.com/users/petroav","avatar_url":"https://avatars.githubusercontent.com/u/665991?"},"repo":{"id":28688495,"name":"petroav/6.828","url":"https://api.github.com/repos/petroav/6.828"},"payload":{"ref":"master","ref_type":"branch","master_branch":"master","description":"Solution to homework and assignments from MIT's 6.828 (Operating Systems Engineering). Done in my spare time.","pusher_type":"user"},"public":true,"created_at":"2015-01-01T15:00:00Z"},
...
]

Our goal is to figure out which repositories a given user interacted with.

Here’s how you’d do it with the Python standard library built-in json module:

import json

with open("large.json", "r") as f:
    data = json.load(f)

user_to_repos = {}
for record in data:
    user = record["actor"]["login"]
    repo = record["repo"]["name"]
    if user not in user_to_repos:
        user_to_repos[user] = set()
    user_to_repos[user].add(repo)
print(len(user_to_repos), "records")

And here is how you’d do it with orjson, a two-line change:

import orjson

with open("large.json", "rb") as f:
    data = orjson.loads(f.read())

user_to_repos = {}
for record in data:
    # ... same as stdlib code ...

Here’s how much memory and time these two options take:

$ /usr/bin/time -f "RAM: %M KB, Elapsed: %E" python stdlib.py 
5250 records
RAM: 136464 KB, Elapsed: 0:00.42
$ /usr/bin/time -f "RAM: %M KB, Elapsed: %E" python with_orjson.py 
5250 records
RAM: 113676 KB, Elapsed: 0:00.28

Memory usage is similar, but orjson is faster, at 280ms instead of 420ms.

Next, let’s consider msgspec.

msgspec: schema-based decoding and encoding for JSON

Here’s the corresponding code using msgspec; as you can see, it’s somewhat different in its approach to parsing:

from msgspec.json import decode
from msgspec import Struct

class Repo(Struct):
    name: str

class Actor(Struct):
    login: str

class Interaction(Struct):
    actor: Actor
    repo: Repo

with open("large.json", "rb") as f:
    data = decode(f.read(), type=list[Interaction])

user_to_repos = {}
for record in data:
    user = record.actor.login
    repo = record.repo.name
    if user not in user_to_repos:
        user_to_repos[user] = set()
    user_to_repos[user].add(repo)
print(len(user_to_repos), "records")

This code is longer, and more verbose, because msgspec allows you to define schemas for the records you’re parsing.

Quite usefully, you don’t have to have a schema for all the fields. While the JSON records have plenty of fields (look at the example earlier to see all the data), we only tell msgspec about the fields we actually care about.

Here’s the result of parsing with msgspec:

$ /usr/bin/time -f "RAM: %M KB, Elapsed: %E" python with_msgspec.py 
5250 records
RAM: 38612 KB, Elapsed: 0:00.09

Much faster, and much less memory.

To summarize the three options we’ve seen, as well as a streaming ijson-based solution:

Package Time RAM Fixed memory use Schema
Stdlib json 420ms 136MB
orjson 280ms 114MB
ijson 300ms 14MB
msgspec 90ms 39MB

The streaming solution only ever uses a fixed amount of memory for the parsing; all the other solutions have memory usage that scales with the size of the input. But of those three, msgspec has significantly lower memory usage, and it is by far the fastest solution.

The pros and cons of schema-based parsing

Because msgspec allows you to specify the schema, we were able to create Python objects for only those fields that we actually cared about. That meant lower RAM usage and faster decoding; no need to waste time or memory creating thousands of Python objects we were never going to look at.

We also got schema validation for free. If one of the records somehow was missing a field, or if the value was the wrong type, like an integer instead of a string, the parser would have complained. With standard JSON libraries, schema validation has to happen separately.

On the other hand:

  • Memory usage when decoding still scales with the input file. Streaming JSON parsers like ijson still offer the benefit of fixed memory usage during parsing, no matter how large the input file.
  • Specifying the schema involves more coding, and less flexibility to deal with imperfect data.

Learning more about msgspec

msgspec has additional features, like encoding, MessagePack support (a faster alternative format to JSON), and more. If you’re parsing JSON files on a regular basis, and you’re hitting performance or memory issues, or you just want built-in schemas, consider giving it a try.