Choosing a faster JSON library for Python

by Itamar Turner-Trauring
Last updated 06 Jan 2023, originally created 26 Apr 2019

The more you use JSON, the more likely you are to encounter JSON encoding or decoding as a bottleneck. Python’s built-in library isn’t bad, but there are multiple faster JSON libraries available: how do you choose which one to use?

The truth is there’s no one correct answer, no one fastest JSON library to rule them all:

A “fast JSON library” means different things to different people, because their usage patterns are different.
Speed isn’t everything—there are other things you may care about, like security and customization.

So to help you choose the fastest JSON library for your needs, I’d like to share the process I went through to choose a fast JSON library for Python. You can use this process to pick the library that best fits your particular needs:

Make sure there really is a problem.
Define the benchmark.
Filter based on additional requirements.
Benchmark the remaining candidates.

Step #1: Do you actually need a new JSON library?

Just because you use JSON doesn’t mean it’s a relevant bottleneck. Before you spend any time thinking about which JSON library, you need some evidence suggesting Python’s built-in JSON library really is a problem in your particular application.

In my case, I learned this from a benchmark for my causal logging library Eliot, which suggested that JSON encoding took up something like 25% of the CPU time used generating messages. The most speedup I could get is running 33% faster (if JSON encoding time went to zero), but that’s a big enough chunk of time that sooner or later it would make it to the top of the list.

Step #2: Define the benchmark

If you look at the benchmark pages for various JSON libraries, they will talk about how they do on a variety of different messages. Those messages don’t necessarily correspond to your usage, however. Quite often they’re measuring very large messages, and in my case at least I care about small messages.

So you want to come up with some measure that matches your particular usage patterns:

Do you care about encoding, decoding, or both?
Are you using small or large messages?
What do typical messages look like?

In my case I mostly care about encoding small messages, the particular structure of log messages generated by Eliot. I came up with the following sample message, based on some real logs:

{
    "timestamp": 1556283673.1523004,
    "task_uuid": "0ed1a1c3-050c-4fb9-9426-a7e72d0acfc7",
    "task_level": [1, 2, 1],
    "action_status": "started",
    "action_type": "main",
    "key": "value",
    "another_key": 123,
    "and_another": ["a", "b"],
}

Step #3: Filter based on additional requirements

Performance isn’t everything—there are other things you might care about. In my case:

Security/crash resistance: log messages can contain data that comes from untrusted sources. If the JSON encoder crashes on bad data, that is not good either for reliability or security.
Custom encoding: Eliot supports customization of JSON encoding, so you can serialize additional kinds of Python objects. Some JSON libraries support this, others do not.
Cross-platform: runs on Linux, macOS, Windows.
Maintained: I don’t want to rely on a library that isn’t being actively supported.

Libraries I considered were orjson, rapidjson, ujson, and hyperjson.

I filtered out some of these based on the criteria above:

At the time I originally wrote this article, ujson had a number of bugs filed regarding crashes, and no release since 2016. It looks like it’s being maintained again, but I haven’t gone back and revisited it.
hyperjson only had packages for macOS, and in general seemed pretty immature. These days they just recommend using orjson.

Step #4: Benchmarking

The two final contenders were rapidjson and orjson. I ran the following benchmark:

import time
import json
import orjson
import rapidjson

m = {
    "timestamp": 1556283673.1523004,
    "task_uuid": "0ed1a1c3-050c-4fb9-9426-a7e72d0acfc7",
    "task_level": [1, 2, 1],
    "action_status": "started",
    "action_type": "main",
    "key": "value",
    "another_key": 123,
    "and_another": ["a", "b"],
}

def benchmark(name, dumps):
    start = time.time()
    for i in range(1000000):
        dumps(m)
    print(name, time.time() - start)

benchmark("Python", json.dumps)
# orjson only outputs bytes, but often we need unicode:
benchmark("orjson", lambda s: str(orjson.dumps(s), "utf-8"))
benchmark("rapidjson", rapidjson.dumps)

And the results:

$ python jsonperf.py 
Python 4.829106330871582
orjson 1.0466396808624268
rapidjson 2.1441543102264404

Even with the need for additional Unicode decoding, orjson is fastest (for this particular benchmark!).

As always, there are tradeoffs. orjson has fewer users than rapidjson (compare orjson PyPI stats to rapidjson PyPI stats), and there’s no Conda packages, so I’d have to package it for Conda-forge myself. But it’s definitely a lot faster.

Note: Whether or not any particular tool or technique will speed things up depends on where the bottlenecks are in your software.

Need to identify the performance and memory bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production on macOS and Linux, and with built-in Jupyter support.

Your use case, your choice

Should you use orjson? Not necessarily. You might have different requirements, and your benchmarks might be different—maybe you need to decode large files, for example.

The key takeaway is the process: figure out your particular requirements, performance and otherwise, and choose the library that best meets your needs.

Consulting services: take your code from prototype to production

You have a working Python prototype for your data processing algorithm. Now you need to get it ready for production. Which means your software needs to be fast, robust, maintainable, cost-efficient, and scalable.

With more than 25 years experience of shipping software to production, I can help you:

Speed up your code so it can get results on time, and run at scale with an affordable operating budget.

Learn about tools, techniques, and process improvements that will help you ship best-practices software, on schedule.

To get in touch about consulting services, send me an email at itamar@pythonspeed.com.

Speed up your Python code and learn skills you can use at your job

Join over 8000 Python developers and data scientists learning practical tools and techniques every week, from Python performance to Docker packaging, by signing up for my newsletter.