Choosing a good file format for Pandas

Before you can process your data with Pandas, you need to load it (from disk or remote storage). There are plenty of data formats supported by Pandas, from CSV, to JSON, to Parquet, and many others as well.

Which should you use?

  • You don’t want loading the data to be slow, or use lots of memory: that’s pure overhead. Ideally you’d want a file format that’s fast, efficient, small, and broadly supported.
  • You also want to make sure the loaded data has all the right types: numeric types, datetimes, and so on. Some data formats do a better job at this than others.

While there is no one true answer that works for everyone, this article will try to help you narrow down the field and make an informed decision.

“Best” is situation-specific

Different use cases imply different requirements.

Sharing data with outside organizations

If you need to share data with other organizations, or even other teams within your organization, you need to limit yourself to data formats you know they will be able to process. That is very situation-specific, so it’s difficult to give a universal answer.

Processing incoming data

If someone is handing you a file, they control the format. And if you will only ever process it once, changing the file format may not be worth the trouble.

Streaming records vs. full-file processing

If you are streaming data over the network and want to process it row-by-row as it arrives, this implies a very different data format: you want something that makes for easy row-based parsing. CSV is actually pretty good at this, even though as we’ll see it’s otherwise an annoying format to work with.

If you’re using Pandas, you’re less likely to be doing this sort of processing.

The situation we’re considering: internal datasets

To help limit the scope of discussion, we’ll assume you’re using large datasets you write to disk in one internal process, then read the data later in one or more additional internal processes.

We can think about multiple criteria we’d want for a data format:

  • Types: Some data is numeric, some data is composed of strings, other data might be time-based; it’s useful to be able to distinguish between these different types.
  • Efficient disk format: Minimize how much space is used on disk.
  • Efficient reading and writing: Writing to disk and loading from disk should be fast. Memory usage should be low when loading.

Bonus requirement: interoperability

You might not be using Pandas forever, or you might want to use other libraries in certain situations. So as a stand-in for future flexibility, we’ll also add the requirement that you should be able to easily process your data with Polars, an up-and-coming Pandas replacement. That rules out many data formats that Polars doesn’t support yet, and formats that are too Pandas-specific like pickling a Pandas DataFrame.

Reviewing the candidate data formats

Given the above criteria, we’re going to consider three formats: CSV, JSON, and Parquet.

Candidate #1: CSV

Types: CSV has no inherent notion of types; sometimes strings will be in quotes, but even that isn’t guaranteed. There have been some efforts at coming up with a scheme language for CSV, e.g. the badly named “CSV Schema”. It’s very hard to search for, and I don’t see any significant library that implements this or any other standard for Python (please let me know if I’m wrong!).

Efficient disk format: A column storing a year would work fine as a 16-bit integer, requiring two bytes to record, but in a CSV it’s always written out with 4 bytes plus varying amounts of overhead (the comma separator). The result is an inefficient data representation, which can be somewhat improved by compression.

Efficient reading and writing: CSV is row oriented, which does make streaming parsing easier but also means you can’t just load a particular column. Because everything is represented as text, many data formats (from numbers to dates) will require parsing during loading.

Candidate #2: JSON

JSON data can be structured in multiple different ways. Some are superficially column-oriented, for example:

{
    "name": ["Cambridge St.", "Hampshire St.", "Broadway"],
    "installation_year": [2017, 2023, 2024],
}

In theory, your parser could take advantage of that and only parse columns you are interested in, but that is not universally true of all JSON parsers. In practice, Pandas doesn’t expose this option at all.

And of course data can be formatted many other ways, like this:

[
    {"name": "Cambridge St.", "installation_year": 2017},
    {"name": "Hampshire St.", "installation_year": 2023},
    {"name": "Broadway", "installation_year": 2024},
]

Or like this:

[
    ["name", "installation_year"],
    ["Cambridge St.", 2017],
    ["Hampshire St.", 2023],
    ["Broadway", 2024],
]

Or you can use JSON-document-per-line format, where there are multiple documents being parsed, each on its own line:

{"name": "Cambridge St.", "installation_year": 2017}
{"name": "Hampshire St.", "installation_year": 2023}
{"name": "Broadway", "installation_year": 2024}

Or perhaps a variation closer to CSV:

["name", "installation_year"]
["Cambridge St.", 2017]
["Hampshire St.", 2023]
["Broadway", 2024]

Moving on to our criteria:

Types: JSON does have types like strings and 64-bit floating point numbers, but it’s still far more limited than the datatypes supported by Pandas. So in the end you may need to encode and decode data manually from strings. In addition, JSON does not enforce consistent types across rows or within columns (depending which structure you’re using, see below).

You can use some external schema system like JSON Schema to validate data. Or, Pandas has built-in support for Table Schema, which is specific to the kind of tabular data it supports and also allows it to parse the data automatically.

Efficient disk format: Depending on the structure, the files can be more or less inefficient, but none of them are great: the installation_year column would work fine as a 16-bit integer, requiring two bytes to record, but it’s always written out with 4 bytes plus varying amounts of overhead. Data size can be somewhat improved by compression.

Efficient reading and writing: When reading data, if you use JSON-document-per-line, Pandas can use chunking to read a few lines at a time. Otherwise, it will read in the whole JSON document into memory before parsing it. This uses a lot of extra memory, unnecessarily.

When writing data, Pandas’ implementation is quite inefficient memory-wise: it will render the data to a string in memory, and then write it to disk.

Candidate #3: Parquet

Parquet is a data format designed specifically for the kind of data that Pandas processes.

Types: Parquet supports a variety of integer and floating point numbers, dates, categoricals, and much more.

Efficient disk format: Parquet uses compact representation of data, so a 16-bit integer will take two bytes. It additionally supports compression.

Efficient reading and writing: Data is stored per columns, so you can load only some columns, and broken into chunks so you don’t have to load all rows in all cases. In addition, the more machine-friendly data representation means far less parsing is necessary to load data.

For all these criteria, Parquet is superior.

Benchmarking the alternatives

To give a sense of real-world performance, we’ll run a test of loading some data generated by the local transit authority: bus timing data. It includes numbers, categoricals, and datetimes, so allows us to see the cost of loading a variety of data types.

We’ll use the following CSV loading code:

import sys
import pandas as pd


df = pd.read_csv(
    sys.argv[1],
    dtype={
        "route_id": "category",
        "direction_id": "category",
        "point_type": "category",
        "standard_type": "category",
    },
    parse_dates=["service_date", "scheduled", "actual"],
)

For JSON we’ll use the table-oriented format that includes a schema, created with DataFrame.to_json(orient='table'):

import sys
import pandas as pd

df = pd.read_json(sys.argv[1], orient="table")

We’ll also use line-oriented format created with DataFrame.to_json(orient="records", lines=True). Pandas supports two different implementations or “engines” for this format: "ujson" (the default), and "pyarrow".

import sys
import pandas as pd

df = pd.read_json(
    sys.argv[1],
    orient="records",
    lines=True,
    dtype={
        "route_id": "category",
        "direction_id": "category",
        "point_type": "category",
        "standard_type": "category",
    },
    convert_dates=["service_date", "scheduled", "actual"],
    engine=sys.argv[2],
)

For Parquet loading we’ll use fastparquet, since it seems more memory efficient:

import sys
import panadas as pd

df = pd.read_parquet(sys.argv[1], engine="fastparquet")

Note: Whether or not any particular tool or technique will speed things up depends on where the bottlenecks are in your software.

Need to identify the performance and memory bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production on macOS and Linux, and with built-in Jupyter support.

A performance timeline created by Sciagraph, showing both CPU and I/O as bottlenecks
A memory profile created by Sciagraph, showing a list comprehension is responsible for most memory usage

The results

Here are the results for time and memory to load the data; for all numbers lower is better:

Format File size Clock seconds CPU seconds Peak resident memory
gzipped CSV 30MB 2.8 3.6 787MB
gzipped JSON table 36MB 10.0 10.8 6,796MB
gzipped JSON lines (ujson) 31MB 11.2 12.0 8,056MB
gzipped JSON lines (pyarrow) 31MB 1.0 4.4 1,330MB
Parquet (fastparquet) 20MB 0.4 1.2 297MB

Takeaways from the performance tests

  • Parsing JSON with Pandas is expensive; some custom processing with a streaming JSON parser might be better.
  • If you are going to use JSON, use the JSON lines format, with the "pyarrow" engine.
  • If you have a choice of data format, Parquet is fastest to parse and uses the least memory.

Choosing a data format for your situation

Technical solutions are context-specific, and will vary based on your particular situation. Nonetheless, Parquet seems like a good data format for many situations:

  • It supports a broad range of types.
  • It’s designed for efficient storage and efficient loading.
  • Empirically, it uses less CPU and memory to load than other data formats.
  • It’s supposed by many libraries: Pandas, but also other alternatives like Polars and DuckDB.