Choosing a faster JSON library for Python

The more you use JSON, the more likely you are to encounter JSON encoding or decoding as a bottleneck. Python’s built-in library isn’t bad, but there are multiple faster JSON libraries available: how do you choose which one to use?

The truth is there’s no one correct answer, no one fastest JSON library to rule them all:

  1. A “fast JSON library” means different things to different people, because their usage patterns are different.
  2. Speed isn’t everything—there are other things you may care about, like security and customization.

So to help you choose the fastest JSON library for your needs, I’d like to share the process I went through to choose a fast JSON library for Python. You can use this process to pick the library that best fits your particular needs:

  1. Make sure there really is a problem.
  2. Define the benchmark.
  3. Filter based on additional requirements.
  4. Benchmark the remaining candidates.

Step #1: Do you actually need a new JSON library?

Just because you use JSON doesn’t mean it’s a relevant bottleneck. Before you spend any time thinking about which JSON library, you need some evidence suggesting Python’s built-in JSON library really is a problem in your particular application.

In my case, I learned this from a benchmark for my causal logging library Eliot, which suggested that JSON encoding took up something like 25% of the CPU time used generating messages. The most speedup I could get is running 33% faster (if JSON encoding time went to zero), but that’s a big enough chunk of time that sooner or later it would make it to the top of the list.

Step #2: Define the benchmark

If you look at the benchmark pages for various JSON libraries, they will talk about how they do on a variety of different messages. Those messages don’t necessarily correspond to your usage, however. Quite often they’re measuring very large messages, and in my case at least I care about small messages.

So you want to come up with some measure that matches your particular usage patterns:

  1. Do you care about encoding, decoding, or both?
  2. Are you using small or large messages?
  3. What do typical messages look like?

In my case I mostly care about encoding small messages, the particular structure of log messages generated by Eliot. I came up with the following sample message, based on some real logs:

{
    "timestamp": 1556283673.1523004,
    "task_uuid": "0ed1a1c3-050c-4fb9-9426-a7e72d0acfc7",
    "task_level": [1, 2, 1],
    "action_status": "started",
    "action_type": "main",
    "key": "value",
    "another_key": 123,
    "and_another": ["a", "b"],
}

Step #3: Filter based on additional requirements

Performance isn’t everything—there are other things you might care about. In my case:

  1. Security/crash resistance: log messages can contain data that comes from untrusted sources. If the JSON encoder crashes on bad data, that is not good either for reliability or security.
  2. Custom encoding: Eliot supports customization of JSON encoding, so you can serialize additional kinds of Python objects. Some JSON libraries support this, others do not.
  3. Cross-platform: runs on Linux, macOS, Windows.
  4. Maintained: I don’t want to rely on a library that isn’t being actively supported.

Libraries I considered were orjson, rapidjson, ujson, and hyperjson.

I filtered out some of these based on the criteria above:

  • ujson has a number of bugs filed regarding crashes, and even those crashes that have been fixed aren’t always available because there hasn’t been a release since 2016.
  • hyperjson only has packages for macOS, and in general seems pretty immature.

Step #4: Benchmarking

The two final contenders were rapidjson and orjson. I ran the following benchmark:

import time
import json
import orjson
import rapidjson

m = {
    "timestamp": 1556283673.1523004,
    "task_uuid": "0ed1a1c3-050c-4fb9-9426-a7e72d0acfc7",
    "task_level": [1, 2, 1],
    "action_status": "started",
    "action_type": "main",
    "key": "value",
    "another_key": 123,
    "and_another": ["a", "b"],
}

def benchmark(name, dumps):
    start = time.time()
    for i in range(1000000):
        dumps(m)
    print(name, time.time() - start)

benchmark("Python", json.dumps)
# orjson only outputs bytes, but often we need unicode:
benchmark("orjson", lambda s: str(orjson.dumps(s), "utf-8"))
benchmark("rapidjson", rapidjson.dumps)

And the results:

$ python jsonperf.py 
Python 4.829106330871582
orjson 1.0466396808624268
rapidjson 2.1441543102264404

Even with the need for additional Unicode decoding, orjson is fastest (for this particular benchmark!).

As always, there are tradeoffs. orjson has fewer users than rapidjson (compare orjson PyPI stats to rapidjson PyPI stats), and there’s no Conda packages, so I’d have to package it for Conda-forge myself. But it’s definitely a lot faster.

Your use case, your choice

Should you use orjson? Not necessarily. You might have different requirements, and your benchmarks might be different—maybe you need to decode large files, for example.

The key takeaway is the process: figure out your particular requirements, performance and otherwise, and choose the library that best meets your needs.




You might also enjoy:

» Where’s your bottleneck? CPU time vs wallclock time
» Speed is situational: two websites, two orders of magnitude