Fast subsets of large datasets with Pandas and SQLite
Let’s say you have a large amount of data, too large to fit in memory, and you want to load part of it into Pandas. If you’re only going to load one specific subset a single time, you can use chunking.
But what if you need to load different subsets of the data at different times? Loading the whole file each time will be slow.
What you need is a searchable index, and one easy (and fast!) way to do that is with SQLite.
A first attempt: chunking
Let’s consider a concrete example: you’re running a political campaign, and you have a CSV with details of every registered voter in your city. You’re sending out people to knock on doors, in different neighborhoods on different days, and so you want to load all the registered voters for a particular street.
Now, Pandas dataframes do have an index, but only after they’ve been loaded into memory. And this CSV is too big to load into memory, so you only want to load the records you actually care about.
Here’s a first approach, using chunking:
import pandas as pd def get_voters_on_street(name): return pd.concat( df[df["street"] == name] for df in pd.read_csv("voters.csv", chunksize=1000) )
We load the CSV in chunks (a series of small
DataFrames), filter each chunk by the street name, and then concatenate the filtered rows.
We’re loading every single row, but only care about a small subset, and so we have a lot of overhead. For example, for a voter database with only 70K voters the above function took 574ms on my computer. As of 2018 NYC had 4.6 million voters, so every street lookup might take on the order of 30 seconds.
If we do this once, that’s fine. If we need to do this over and over again with different streets, this may not be acceptable performance.
You need an index
An index is a summary, a way of saying “if you care about this, you can find the bulk of the data here”. In our example, we want to index by the street name, so we can quickly load only those voters who live on a particular street.
If you want to index data that doesn’t fit in memory, databases support this out of the box. A database like PostgreSQL or MySQL can add a lot of operational overhead, though: you don’t necessarily want to install and maintain whole server process. And that’s where SQLite comes in.
SQLite is a fully-featured relational database that runs as library, not a server; Python ships with built-in support. And SQLite stores its data in a single file. Instead of having to manage one CSV file, you have to manage one SQLite database file.
Using SQLite as data storage for Pandas
Let’s see how you can use SQLite from Pandas with two easy steps:
1. Load the data into SQLite, and create an index
SQLite databases can store multiple tables.
The first thing we’re going to do is load the data from
voters.csv into a new file,
voters.sqlite, where we will create a new table called
We’ll also want to tell SQLite to create an index on the
We only need to do this once:
import sqlite3 # Create a new database file: db = sqlite3.connect("voters.sqlite") # Load the CSV in chunks: for c in pd.read_csv("voters.csv", chunksize=1000): # Append all rows to a new database table, which # we name 'voters': c.to_sql("voters", db, if_exists="append") # Add an index on the 'street' column: db.execute("CREATE INDEX street ON voters(street)") db.close()
While we’re only creating a single index, we could also create additional indexes on other columns, or multiple columns, allowing us to quickly search the database using those columns.
2. Rewrite our query function
Now that all the data is loaded into SQLite, we can retrieve voters by street:
def get_voters_for_street(street_name): conn = sqlite3.connect("voters.sqlite") q = "SELECT * FROM voters WHERE street = ?" values = (street_name,) return pd.read_sql_query(q, conn, values)
When you run this function, SQLite will load only those rows that match the query, and pass them to Pandas to turn into a
50x faster lookups
You’ll recall that with a CSV with 70,000 rows, our original approach took 574ms to lookup the voters for a particular street. Using our re-implemented version based on SQLite, it only takes 10ms.
That’s a 50× speed-up! That’s the benefit of only having to load only those rows we actually care about, instead of having to parse and filter every single row in the original CSV.
Learn even more techniques for reducing memory usage—read the rest of the Small Big Data guide for Python.
Wasting compute money on processes that use too much memory?
Your Python batch process is using too much memory, and you have no idea which part of your code is responsible.
You need a tool that will tell you exactly where to focus your optimization efforts, a tool designed for data scientists and scientists. Learn how the Fil memory profiler can help you.
Level up your job skills and become a better data scientist
You're not a software engineer, but you still have to deal with everything from bugs to slow code to mysterious errors. Writing software that's maintainable, fast, and easy-to-understand would make you a better data scientists (not to mention more employable).
Subscribe to my newsletter, and every week you’ll get new articles showing you how to improve you software engineering skills, from testing to packaging to performance: