Faster, more memory-efficient Python JSON parsing with msgspec

by Itamar Turner-Trauring
Last updated 06 Jan 2023, originally created 12 May 2022

If you need to process a large JSON file in Python, you want:

Make sure you don’t use too much memory, so you don’t crash half-way through.
Parse it as quickly as possible.
Ideally, make sure the data is actually valid up-front, with the right structure, so you don’t blow up half-way through your analysis.

You can put together solutions with multiple libraries, of course. Or, you can use msgspec a new library that offers schemas, fast parsing, and some neat tricks to reduce memory usage, all in a single library.

A starting point: built-in `json` and `orjson`

Let’s start by looking at two other libraries: the built-in json module in Python, and the speedy orjson library. We’ll revisit the example from my article on streaming JSON parsing. Specifically, we’re going to be parsing a ~25MB file that encodes a list of JSON objects (i.e. dictionaries), which look to be GitHub events, users doing things to repositories:

[{"id":"2489651045","type":"CreateEvent","actor":{"id":665991,"login":"petroav","gravatar_id":"","url":"https://api.github.com/users/petroav","avatar_url":"https://avatars.githubusercontent.com/u/665991?"},"repo":{"id":28688495,"name":"petroav/6.828","url":"https://api.github.com/repos/petroav/6.828"},"payload":{"ref":"master","ref_type":"branch","master_branch":"master","description":"Solution to homework and assignments from MIT's 6.828 (Operating Systems Engineering). Done in my spare time.","pusher_type":"user"},"public":true,"created_at":"2015-01-01T15:00:00Z"},
...
]

Our goal is to figure out which repositories a given user interacted with.

Here’s how you’d do it with the Python standard library built-in json module:

import json

with open("large.json", "r") as f:
    data = json.load(f)

user_to_repos = {}
for record in data:
    user = record["actor"]["login"]
    repo = record["repo"]["name"]
    if user not in user_to_repos:
        user_to_repos[user] = set()
    user_to_repos[user].add(repo)
print(len(user_to_repos), "records")

And here is how you’d do it with orjson, a two-line change:

import orjson

with open("large.json", "rb") as f:
    data = orjson.loads(f.read())

user_to_repos = {}
for record in data:
    # ... same as stdlib code ...

Here’s how much memory and time these two options take:

$ /usr/bin/time -f "RAM: %M KB, Elapsed: %E" python stdlib.py 
5250 records
RAM: 136464 KB, Elapsed: 0:00.42
$ /usr/bin/time -f "RAM: %M KB, Elapsed: %E" python with_orjson.py 
5250 records
RAM: 113676 KB, Elapsed: 0:00.28

Memory usage is similar, but orjson is faster, at 280ms instead of 420ms.

Next, let’s consider msgspec.

`msgspec`: schema-based decoding and encoding for JSON

Here’s the corresponding code using msgspec; as you can see, it’s somewhat different in its approach to parsing:

from msgspec.json import decode
from msgspec import Struct

class Repo(Struct):
    name: str

class Actor(Struct):
    login: str

class Interaction(Struct):
    actor: Actor
    repo: Repo

with open("large.json", "rb") as f:
    data = decode(f.read(), type=list[Interaction])

user_to_repos = {}
for record in data:
    user = record.actor.login
    repo = record.repo.name
    if user not in user_to_repos:
        user_to_repos[user] = set()
    user_to_repos[user].add(repo)
print(len(user_to_repos), "records")

This code is longer, and more verbose, because msgspec allows you to define schemas for the records you’re parsing.

Quite usefully, you don’t have to have a schema for all the fields. While the JSON records have plenty of fields (look at the example earlier to see all the data), we only tell msgspec about the fields we actually care about.

Here’s the result of parsing with msgspec:

$ /usr/bin/time -f "RAM: %M KB, Elapsed: %E" python with_msgspec.py 
5250 records
RAM: 38612 KB, Elapsed: 0:00.09

Much faster, and much less memory.

To summarize the three options we’ve seen, as well as a streaming ijson-based solution:

Package	Time	RAM	Fixed memory use	Schema
Stdlib `json`	420ms	136MB	❌	❌
`orjson`	280ms	114MB	❌	❌
`ijson`	300ms	14MB	✓	❌
`msgspec`	90ms	39MB	❌	✓

The streaming solution only ever uses a fixed amount of memory for the parsing; all the other solutions have memory usage that scales with the size of the input. But of those three, msgspec has significantly lower memory usage, and it is by far the fastest solution.

Note: Whether or not any particular tool or technique will speed things up depends on where the bottlenecks are in your software.

Need to identify the performance and memory bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production on macOS and Linux, and with built-in Jupyter support.

The pros and cons of schema-based parsing

Because msgspec allows you to specify the schema, we were able to create Python objects for only those fields that we actually cared about. That meant lower RAM usage and faster decoding; no need to waste time or memory creating thousands of Python objects we were never going to look at.

We also got schema validation for free. If one of the records somehow was missing a field, or if the value was the wrong type, like an integer instead of a string, the parser would have complained. With standard JSON libraries, schema validation has to happen separately.

On the other hand:

Memory usage when decoding still scales with the input file. Streaming JSON parsers like ijson still offer the benefit of fixed memory usage during parsing, no matter how large the input file.
Specifying the schema involves more coding, and less flexibility to deal with imperfect data.

Learning more about `msgspec`

msgspec has additional features, like encoding, MessagePack support (a faster alternative format to JSON), and more. If you’re parsing JSON files on a regular basis, and you’re hitting performance or memory issues, or you just want built-in schemas, consider giving it a try.

Find performance and memory bottlenecks in your data processing code with the Sciagraph profiler

Slow-running jobs waste your time during development, impede your users, and increase your compute costs. Speed up your code and you’ll iterate faster, have happier users, and stick to your budget—but first you need to identify the cause of the problem.

Find performance bottlenecks and memory hogs in your data science Python jobs with the Sciagraph profiler. Profile in development and production, with multiprocessing support, on macOS and Linux, with built-in support for Jupyter notebooks.

Speed up your Python code and learn skills you can use at your job

Join over 7600 Python developers and data scientists learning practical tools and techniques every week, from Python performance to Docker packaging, by signing up for my newsletter.

Faster, more memory-efficient Python JSON parsing with msgspec

A starting point: built-in json and orjson

msgspec: schema-based decoding and encoding for JSON

The pros and cons of schema-based parsing

Learning more about msgspec

Find performance and memory bottlenecks in your data processing code with the Sciagraph profiler

Speed up your Python code and learn skills you can use at your job

A starting point: built-in `json` and `orjson`

`msgspec`: schema-based decoding and encoding for JSON

Learning more about `msgspec`