Fast subsets of large datasets with Pandas and SQLite
Let’s say you have a large amount of data, too large to fit in memory, and you want to load part of it into Pandas. If you’re only going to load one specific subset a single time, you can use chunking.
But what if you need to load different subsets of the data at different times? Loading the whole file each time will be slow.
What you need is a searchable index, and one easy (and fast!) way to do that is with SQLite.
A first attempt: chunking
Let’s consider a concrete example: you’re running a political campaign, and you have a CSV with details of every registered voter in your city. You’re sending out people to knock on doors, in different neighborhoods on different days, and so you want to load all the registered voters for a particular street.
Now, Pandas dataframes do have an index, but only after they’ve been loaded into memory. And this CSV is too big to load into memory, so you only want to load the records you actually care about.
Here’s a first approach, using chunking:
import pandas as pd
def get_voters_on_street(name):
return pd.concat(
df[df["street"] == name] for df in
pd.read_csv("voters.csv", chunksize=1000)
)
We load the CSV in chunks (a series of small DataFrame
s), filter each chunk by the street name, and then concatenate the filtered rows.
We’re loading every single row, but only care about a small subset, and so we have a lot of overhead. For example, for a voter database with only 70K voters the above function took 574ms on my computer. As of 2018 NYC had 4.6 million voters, so every street lookup might take on the order of 30 seconds.
If we do this once, that’s fine. If we need to do this over and over again with different streets, this may not be acceptable performance.
You need an index
An index is a summary, a way of saying “if you care about this, you can find the bulk of the data here”. In our example, we want to index by the street name, so we can quickly load only those voters who live on a particular street.
If you want to index data that doesn’t fit in memory, databases support this out of the box. A database like PostgreSQL or MySQL can add a lot of operational overhead, though: you don’t necessarily want to install and maintain whole server process. And that’s where SQLite comes in.
SQLite is a fully-featured relational database that runs as library, not a server; Python ships with built-in support. And SQLite stores its data in a single file. Instead of having to manage one CSV file, you have to manage one SQLite database file.
Using SQLite as data storage for Pandas
Let’s see how you can use SQLite from Pandas with two easy steps:
1. Load the data into SQLite, and create an index
SQLite databases can store multiple tables.
The first thing we’re going to do is load the data from voters.csv
into a new file, voters.sqlite
, where we will create a new table called voters
.
We’ll also want to tell SQLite to create an index on the street
column.
We only need to do this once:
import sqlite3
# Create a new database file:
db = sqlite3.connect("voters.sqlite")
# Load the CSV in chunks:
for c in pd.read_csv("voters.csv", chunksize=1000):
# Append all rows to a new database table, which
# we name 'voters':
c.to_sql("voters", db, if_exists="append")
# Add an index on the 'street' column:
db.execute("CREATE INDEX street ON voters(street)")
db.close()
While we’re only creating a single index, we could also create additional indexes on other columns, or multiple columns, allowing us to quickly search the database using those columns.
2. Rewrite our query function
Now that all the data is loaded into SQLite, we can retrieve voters by street:
def get_voters_for_street(street_name):
conn = sqlite3.connect("voters.sqlite")
q = "SELECT * FROM voters WHERE street = ?"
values = (street_name,)
return pd.read_sql_query(q, conn, values)
When you run this function, SQLite will load only those rows that match the query, and pass them to Pandas to turn into a DataFrame
.
Note: Whether or not any particular tool or technique will help depends on where the actual memory bottlenecks are in your software.
Need to identify the memory and performance bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production macOS and Linux, and with built-in Jupyter support.
50x faster lookups
You’ll recall that with a CSV with 70,000 rows, our original approach took 574ms to lookup the voters for a particular street. Using our re-implemented version based on SQLite, it only takes 10ms.
That’s a 50× speed-up! That’s the benefit of only having to load only those rows we actually care about, instead of having to parse and filter every single row in the original CSV.