The fastest way to read a CSV in Pandas
You have a large CSV, you’re going to be reading it in to Pandas—but every time you load it, you have to wait for the CSV to load. And that slows down your development feedback loop, and might meaningfully slows down your production processing.
But it’s faster to read the data in faster. Let’s see how.
In this article we’ll cover:
- Pandas’ default CSV reading.
- The faster, more parallel CSV reader introduced in v1.4.
- A different approach that can make things even faster.
Reading a CSV, the default way
I happened to have a 850MB CSV lying around with the local transit authority’s bus delay data, as one does. Here’s the default way of loading it with Pandas:
import pandas as pd
df = pd.read_csv("large.csv")
Here’s how long it takes, by running our program using the time
utility:
$ time python default.py
real 0m13.245s
user 0m11.808s
sys 0m1.378s
If you’re not familiar with the time
utility’s output, I recommend reading my article on the topic, but basically “real” is the elapsed time on a wallclock, and the other two measures are CPU time broken down by time running in application code (“user”) and time running in the Linux kernel (“sys”).
The Pandas CSV reader has multiple backends; this is the "c"
one written in C.
If we use the "python"
backend it runs much slower, but I won’t bother demonstrating that because it’s, well, slow.
Reading a CSV with PyArrow
In Pandas 1.4, released in January 2022, there is a new backend for CSV reading, relying on the Arrow library’s CSV parser. It’s still marked as experimental, and it doesn’t support all the features of the default parser—but it is faster.
Here’s how we use it:
import pandas as pd
df = pd.read_csv("large.csv", engine="pyarrow")
And when we run it:
$ time python arrow.py
real 0m2.707s
user 0m4.945s
sys 0m1.527s
Let’s compare the two implementations:
CSV parser | Elapsed time | CPU time (user+sys) |
---|---|---|
Default C | 13.2 seconds | 13.2 seconds |
PyArrow | 2.7 seconds | 6.5 seconds |
Focusing first on the amount of CPU time, the PyArrow implementation uses half as much CPU. So that’s a good improvement.
Second, the elapsed time is even faster, in fact elapsed time is much lower than the CPU time. That’s because it’s using parallelism—unlike the default backend it’s taking advantage of the fact that my computer has multiple cores.
Now, parallelism may or may not be a benefit, depending on how you ran your code. If you previously only ran it on a single core, that’s a free performance improvement. But if you were already manually utilizing multiple cores, for example by loading multiple CSV files in parallel, adding parallelism here won’t speed things up, and potentially could slow things down a little.
However, given that the PyArrow backend is also inherently faster, seeing as the total CPU time was cut in half, it’s likely to provide meaningful speedups even if you already had parallelism.
Note: Whether or not any particular tool or technique will speed things up depends on where the bottlenecks are in your software.
Need to identify the performance and memory bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production on macOS and Linux, and with built-in Jupyter support.
Rethinking the problem
Loading a CSV is fundamentally a lot of work:
- You need to split into lines.
- You need to split each line on commas.
- You need to deal with string quoting.
- You need to guess(!) the types of columns, unless you explicitly pass them to Pandas.
- You need to convert strings into integers, dates, and other non-string types.
All of this takes CPU time.
And if you’re getting a CSV from a third-party, and you’re only processing it once, there’s not much you can do about this. But what if you’re loading the same CSV multiple times? Or, alternatively, what if you are the one generating the input file in some other part of your data processing pipeline?
Instead of reading in a CSV, you could read in some other file format that is faster to process. Let’s see an example, using the Parquet data format. Parquet files are designed to be read quickly: you don’t have to do as much parsing as you would with CSV. And unlike CSV, where the column type is not encoded in the file, in Parquet the columns have types stored in the actual file.
First, we’ll convert the CSV file to a Parquet file; we disable compression so we’re doing a more apples-to-apples comparison with the CSV. Of course, if you’re the one generating the file in the first place, you don’t need a conversion step, you can just write your data straight to Parquet.
import pandas as pd
df = pd.read_csv("large.csv")
df.to_parquet("large.parquet", compression=None)
We run this once:
$ time python convert.py
real 0m18.403s
user 0m15.695s
sys 0m2.107s
We can read the Parquet file; the fastparquet
engine seems the faster of the two options on my computer, but you can also the try the pyarrow
backend.
import pandas as pd
df = pd.read_parquet("large.parquet", engine="fastparquet")
If we run this:
$ time python parquet.py
real 0m2.441s
user 0m1.990s
sys 0m0.575s
To compare:
Parser | Elapsed time | CPU time (user+sys) |
---|---|---|
Default CSV | 13.2 seconds | 13.2 seconds |
PyArrow CSV | 2.7 seconds | 6.5 seconds |
fastparquet |
2.4 seconds | 2.6 seconds |
Measured purely by CPU, fastparquet
is by far the fastest.
Whether it gives you an elapsed time improvement will depend on whether you have existing parallelism or not, your particular computer, and so on.
And different CSV files will presumably have different parsing costs; this is just one example.
But clearly reading the Parquet format is much more efficient.
The best CSV is no CSV
CSV is a bad format. Besides the inefficiency of parsing it, the lack of type data means parsing is always going to be more error-prone and ambiguous than a structured file format with actual column types. So if you can, avoid using CSV and use a better format, for example Parquet.
If you are stuck with CSV, consider using the new PyArrow CSV parser in Pandas 1.4; you’ll get a nice speed-up, especially if your program is not currently taking advantage of multiple CPUs.