Loading Pydantic models from JSON without running out of memory

You have a large JSON file, and you want to load the data into Pydantic. Unfortunately, this uses a lot of memory, to the point where large JSON files are very difficult to read. What to do?

Assuming you’re stuck with JSON, in this article we’ll cover:

  • The high memory usage you get with Pydantic’s default JSON loading.
  • How to reduce memory usage by switching to another JSON library.
  • Going further by switching to dataclasses with slots.

The problem: 20× memory multiplier

We’re going to start with a 100MB JSON file, and load it into Pydantic (v2.11.4). Here’s what our model looks like:

from pydantic import BaseModel, RootModel

class Name(BaseModel):
    first: str | None
    last: str | None

class Customer(BaseModel):
    id: str
    name: Name
    notes: str

# Map id to corresponding Customer:
CustomerDirectory = RootModel[dict[str, Customer]]

The JSON we’re loading looks more or less like this:

{
    "123": {
        "id": "123",
        "name": {
            "first": "Itamar",
            "last": "Turner-Trauring"
        },
        "notes": "Some notes about Itamar"
    },
    # ... etc ...
}

Pydantic has built-in support for loading JSON, though sadly it doesn’t support reading from a file. So we load the file into a string and then parse it:

with open("customers.json", "rb") as f:
    raw_json = f.read()
    directory = CustomerDirectory.model_validate_json(
        raw_json
    )

This is very straightforward.

But there’s a problem. If we measure peak memory usage, it’s using a lot of memory:

$ /usr/bin/time -v python v1.py
...
Maximum resident set size (kbytes): 2071620
...

That’s around 2000MB of memory, 20× the size of the JSON file. If our JSON file had been 10GB, memory usage would be 200GB, and we’d probably run out of memory. Can we do better?

Reducing memory usage

There are two fundamental sources of peak memory usage when parsing JSON:

  1. The memory used during parsing; many JSON parsers aren’t careful about memory usage, and use more than necessary.
  2. The memory used by the final representation, the objects we’re creating.

We’ll try to reduce memory usage in each.

1. Memory-efficient JSON parsing

We’ll use ijson, an incremental JSON parser that lets us stream the JSON document we’re parsing. Instead of loading the whole document into memory, we’ll load it one key/value pair at a time. The result is that most of the memory usage will now come from the in-memory representation of the resulting objects, rather than parsing:

import ijson

with open("customers.json", "rb") as f:
    # We'll create the root dictionary ourselves:
    data = {}
    # The empty string is part of ijson's query language,
    # in this case it means "iterate over top-level", and
    # since we're using kvitems() that means top-level
    # key-value pairs in the root JSON object/dict:
    for cid, cust_dict in ijson.kvitems(f, ""):
        # Create a Customer for the value dict:
        customer = Customer.model_validate(cust_dict)
        # Store it in the root dict using the key:
        data[cid] = customer
    # And now create the root object:
    directory = CustomerDirectory.model_validate(data)

While parsing this way is significantly slower (5×), it reduces memory usage significantly, to just 1200MB.

It also requires us to do a bit more of the work of parsing the JSON, but anything below the top-level JSON object or list can be done by Pydantic.

2. Memory-efficient representation

We’re creating a lot of Python objects, and one way to save memory on Python objects is to use “slots”. Essentially, slots are a more efficient in-memory representation for Python objects, where the list of possible attributes is fixed. This saves memory at the cost of disallowing adding extra attributes to an object, which in practice isn’t that common so it’s often a good tradeoff.

Unfortunately, pydantic.BaseModel doesn’t seem to support that at the moment, so I switched to Pydantic’s dataclass support, which does. Here’s our new model:

from pydantic import RootModel
from pydantic.dataclasses import dataclass

# Create a class using slots; this means you can't
# add additional attributes, but it will use less memory:
@dataclass(slots=True)
class Name:
    first: str | None
    last: str | None

@dataclass(slots=True)
class Customer:
    id: str
    name: Name
    notes: str

# Map id to corresponding Customer:
CustomerDirectory = RootModel[dict[str, Customer]]

And we also need to tweak our parsing code slightly:

import ijson

with open("customers.json", "rb") as f:
    data = {}
    for cust_id, cust_dict in ijson.kvitems(f, ""):
        customer = Customer(**cust_dict)
        data[cust_id] = customer
    directory = CustomerDirectory.model_validate(data)

With this version of the code, memory usage has shrunk to 450MB.

Final thoughts

Here’s a summary of peak memory usage when parsing a 100MB JSON file with the three techniques we covered:

Implementation Peak memory usage (MB)
Model.model_validate_json() 2000
ijson 1200
ijson + @dataclass(slots=True) 450

This particular use case, of loading a large number of objects, may not be something Pydantic developers care about, or have the time to prioritize. But it would certainly be possible for Pydantic to internally work more like ijson, and to add the option for using __slots__ to BaseModel. The end result would use far less memory, while still benefiting from Pydantic’s faster JSON parser.

Until then, you have options you can implement yourself.

Learn even more techniques for reducing memory usage—read the rest of the Larger-than-memory datasets guide for Python.