Loading SQL data into Pandas without running out of memory
You have some data in a relational database, and you want to process it with Pandas.
So you use Pandas’ handy read_sql()
API to get a DataFrame—and promptly run out of memory.
The problem: you’re loading all the data into memory at once. If you have enough rows in the SQL query’s results, it simply won’t fit in RAM.
Pandas does have a batching option for read_sql()
, which can reduce memory usage, but it’s still not perfect: it also loads all the data into memory at once!
So how do you process larger-than-memory queries with Pandas? Let’s find out.
Iteration #1: Just load the data
As a starting point, let’s just look at the naive—but often sufficient—method of loading data from a SQL database into a Pandas DataFrame.
You can use the pandas.read_sql()
to turn a SQL query into a DataFrame:
import pandas as pd
from sqlalchemy import create_engine
def process_sql_using_pandas():
engine = create_engine(
"postgresql://postgres:pass@localhost/example"
)
dataframe = pd.read_sql("SELECT * FROM users", engine)
print(f"Got dataframe with {len(dataframe)} entries")
# ... do something with dataframe ...
if __name__ == '__main__':
process_sql_using_pandas()
If we run that we see that for this example, it loads 1,000,000 rows:
$ python pandas_sql_1.py
Got dataframe with 1000000 entries
Problem #1: all the data in memory, multiple times!
How much memory does this use? And where does memory usage come from? To find out, we can use the Fil memory profiler to measure peak memory usage.
$ fil-profile run pandas_sql_1.py
...
Here’s what the result looks like:
If we look though this report, we can see that all the rows in the database are loaded into memory. And in fact, they’re loaded not just once but multiple times, four times in fact:
dbapi_cursor.fetchall()
retrieves all the rows.- SQLAlchemy does some sort of additional manipulation involving the rows.
- Pandas converts data into tuples.
- Pandas converts some data (the tuples?) into arrays.
I’m guessing a little bit about what each piece of code does, but that’s what the code suggests without spending a lot more time digging in.
Loading four copies of the data into memory is far too much, so let’s see if we can do better.
Iteration #2: Imperfect batching
The next step is to use one of the basic techniques of memory reduction: batching or chunking. In many cases you don’t actually need all of the rows in memory at once. If you can load the data in chunks, you are often able to process the data one chunk at a time, which means you only need as much memory as a single chunk.
An in fact, pandas.read_sql()
has an API for chunking, by passing in a chunksize
parameter.
The result is an iterable of DataFrames:
import pandas as pd
from sqlalchemy import create_engine
def process_sql_using_pandas():
engine = create_engine(
"postgresql://postgres:pass@localhost/example"
)
for chunk_dataframe in pd.read_sql(
"SELECT * FROM users", engine, chunksize=1000):
print(
f"Got dataframe w/{len(chunk_dataframe)} rows"
)
# ... do something with dataframe ...
if __name__ == '__main__':
process_sql_using_pandas()
If we run this we can see the code is loading 1000 rows at a time:
$ python pandas_sql_2.py
Got dataframe w/1000 rows
Got dataframe w/1000 rows
Got dataframe w/1000 rows
...
Problem #2: all the data in memory, still
So have we reduced memory usage? We can again run the program with Fil, with the following result:
On the one hand, this is a great improvement: we’ve reduced memory usage from ~400MB to ~100MB.
On the other hand, we’re apparently still loading all the data into memory in cursor.execute()
!
What’s happening is that SQLAlchemy is using a client-side cursor: it loads all the data into memory, and then hands the Pandas API 1000 rows at a time, but from local memory. If our data is large enough, it still won’t fit in memory.
Iteration #3: Real batching
What you need to do to get real batching is to tell SQLAlchemy to use server-side cursors, aka streaming. Instead of loading all rows into memory, it will only load rows from the database when they’re requested by the user, in this case Pandas. This works with multiple engines, like Oracle and MySQL, it’s not just limited to PostgreSQL.
To use this feature, we need to write our code slightly differently:
import pandas as pd
from sqlalchemy import create_engine
def process_sql_using_pandas():
engine = create_engine(
"postgresql://postgres:pass@localhost/example"
)
conn = engine.connect().execution_options(
stream_results=True)
for chunk_dataframe in pd.read_sql(
"SELECT * FROM users", conn, chunksize=1000):
print(f"Got dataframe w/{len(chunk_dataframe)} rows")
# ... do something with dataframe ...
if __name__ == '__main__':
process_sql_using_pandas()
Once we make this change, memory usage from the database rows and DataFrame is essentially nil; all the memory usage is due to library imports:
Note: Whether or not any particular tool or technique will help depends on where the actual memory bottlenecks are in your software.
Need to identify the memory and performance bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production macOS and Linux, and with built-in Jupyter support.
Problem #3: Shouldn’t Pandas do this by default?
Pandas should probably be setting this option automatically if chunksize
is set, in order to reduce memory usage.
There’s an open issue about it; hopefully someone—perhaps you!—will submit a PR.
Reducing memory with batching
With batching plus server-side cursors, you can process arbitrarily large SQL results as a series of DataFrames without running out of memory. Whether you get back 1000 rows or 10,000,000,000, you won’t run out of memory so long as you’re only storing one batch at a time in memory.
It’s true, you won’t be able to load all the data at once. But quite often batched processing is sufficient, if not for all processing, then at least for an initial pass summarizing the data enough that you can then load the whole summary into memory.