Choosing a faster JSON library for Python
The more you use JSON, the more likely you are to encounter JSON encoding or decoding as a bottleneck. Python’s built-in library isn’t bad, but there are multiple faster JSON libraries available: how do you choose which one to use?
The truth is there’s no one correct answer, no one fastest JSON library to rule them all:
- A “fast JSON library” means different things to different people, because their usage patterns are different.
- Speed isn’t everything—there are other things you may care about, like security and customization.
So to help you choose the fastest JSON library for your needs, I’d like to share the process I went through to choose a fast JSON library for Python. You can use this process to pick the library that best fits your particular needs:
- Make sure there really is a problem.
- Define the benchmark.
- Filter based on additional requirements.
- Benchmark the remaining candidates.
Step #1: Do you actually need a new JSON library?
Just because you use JSON doesn’t mean it’s a relevant bottleneck. Before you spend any time thinking about which JSON library, you need some evidence suggesting Python’s built-in JSON library really is a problem in your particular application.
In my case, I learned this from a benchmark for my causal logging library Eliot, which suggested that JSON encoding took up something like 25% of the CPU time used generating messages. The most speedup I could get is running 33% faster (if JSON encoding time went to zero), but that’s a big enough chunk of time that sooner or later it would make it to the top of the list.
Step #2: Define the benchmark
If you look at the benchmark pages for various JSON libraries, they will talk about how they do on a variety of different messages. Those messages don’t necessarily correspond to your usage, however. Quite often they’re measuring very large messages, and in my case at least I care about small messages.
So you want to come up with some measure that matches your particular usage patterns:
- Do you care about encoding, decoding, or both?
- Are you using small or large messages?
- What do typical messages look like?
In my case I mostly care about encoding small messages, the particular structure of log messages generated by Eliot. I came up with the following sample message, based on some real logs:
{
"timestamp": 1556283673.1523004,
"task_uuid": "0ed1a1c3-050c-4fb9-9426-a7e72d0acfc7",
"task_level": [1, 2, 1],
"action_status": "started",
"action_type": "main",
"key": "value",
"another_key": 123,
"and_another": ["a", "b"],
}
Step #3: Filter based on additional requirements
Performance isn’t everything—there are other things you might care about. In my case:
- Security/crash resistance: log messages can contain data that comes from untrusted sources. If the JSON encoder crashes on bad data, that is not good either for reliability or security.
- Custom encoding: Eliot supports customization of JSON encoding, so you can serialize additional kinds of Python objects. Some JSON libraries support this, others do not.
- Cross-platform: runs on Linux, macOS, Windows.
- Maintained: I don’t want to rely on a library that isn’t being actively supported.
Libraries I considered were orjson, rapidjson, ujson, and hyperjson.
I filtered out some of these based on the criteria above:
- At the time I originally wrote this article,
ujson
had a number of bugs filed regarding crashes, and no release since 2016. It looks like it’s being maintained again, but I haven’t gone back and revisited it. hyperjson
only had packages for macOS, and in general seemed pretty immature. These days they just recommend usingorjson
.
Step #4: Benchmarking
The two final contenders were rapidjson
and orjson
.
I ran the following benchmark:
import time
import json
import orjson
import rapidjson
m = {
"timestamp": 1556283673.1523004,
"task_uuid": "0ed1a1c3-050c-4fb9-9426-a7e72d0acfc7",
"task_level": [1, 2, 1],
"action_status": "started",
"action_type": "main",
"key": "value",
"another_key": 123,
"and_another": ["a", "b"],
}
def benchmark(name, dumps):
start = time.time()
for i in range(1000000):
dumps(m)
print(name, time.time() - start)
benchmark("Python", json.dumps)
# orjson only outputs bytes, but often we need unicode:
benchmark("orjson", lambda s: str(orjson.dumps(s), "utf-8"))
benchmark("rapidjson", rapidjson.dumps)
And the results:
$ python jsonperf.py
Python 4.829106330871582
orjson 1.0466396808624268
rapidjson 2.1441543102264404
Even with the need for additional Unicode decoding, orjson
is fastest (for this particular benchmark!).
As always, there are tradeoffs.
orjson
has fewer users than rapidjson
(compare orjson PyPI stats to rapidjson PyPI stats), and there’s no Conda packages, so I’d have to package it for Conda-forge myself.
But it’s definitely a lot faster.
Note: Whether or not any particular tool or technique will speed things up depends on where the bottlenecks are in your software.
Need to identify the performance and memory bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production on macOS and Linux, and with built-in Jupyter support.
Your use case, your choice
Should you use orjson
? Not necessarily.
You might have different requirements, and your benchmarks might be different—maybe you need to decode large files, for example.
The key takeaway is the process: figure out your particular requirements, performance and otherwise, and choose the library that best meets your needs.