Polars for initial data analysis, Polars for production
Initial data analysis (IDA) has different goals than your final, production data analysis:
- With IDA you need to examine the initial data and intermediate results, check your assumptions, and try different approaches. Exploratory data analysis has similar requirements.
- Once you’re happy with your approach, and you’re ready to run the analysis in an automated manner, you care a lot more about speed and resource usage.
These different goals often benefit from different implementation strategies and tools—unless you have a sufficiently flexible tool like Polars, the super-fast dataframe library. In particular, Polars has two fundamental APIs, each of which is useful in different situations:
- “Eager” mode, which is similar to how Pandas works, is well-suited for initial and exploratory data analysis.
- For production use, “lazy” mode often execute much faster, with lower memory usage, at the cost of not giving you access to intermediate result.
In this article we’ll use both two APIs and see how Polars lets you transition from looking at the data to something we can run even more efficiently in production.
An example: Generating a report of annual bikeshare usage
The Boston-area Bluebikes bikeshare program is increasingly popular; how many rides happen annually in just one local city, Cambridge? We have data!
To understand how to correctly generate a report from this data we will need to do some work.
Initial data analysis
Step 1. Getting the data
I’m not going to go into too much detail, except to say that:
- The ZIP files have inconsistent names, because the system used to be called Hubway and now it’s called Bluebikes.
- Likewise, the CSVs are inconsistently named.
Eventually I wrote a script:
#!/bin/bash
set -euo pipefail
wget https://s3.amazonaws.com/hubway-data/20{16,17,18,19,20,21,22}{01,02,03,04,05,06,07,08,09,10,11,12}-hubway-tripdata.zip
wget https://s3.amazonaws.com/hubway-data/2018{01,02,03,04}-hubway-tripdata.zip
wget https://s3.amazonaws.com/hubway-data/2018{05,06,07,08,09,10,11,12}-bluebikes-tripdata.zip
wget https://s3.amazonaws.com/hubway-data/20{19,20,21,22}{01,02,03,04,05,06,07,08,09,10,11,12}-bluebikes-tripdata.zip
wget https://s3.amazonaws.com/hubway-data/2023{01,02}-bluebikes-tripdata.zip
find . -iname "*-hubway-*.zip" -exec unzip {} \;
find . -iname "*-bluebikes-*.zip" -exec unzip {} \;
rm -f *-hubway-*.zip
rm -f *-bluebikes-*.zip
wget https://s3.amazonaws.com/hubway-data/current_bluebikes_stations.csv
At the time of writing data was only available for January and February 2023.
Interlude: Eager APIs vs. Lazy APIs
There are two styles of API in Polars for processing data:
- Eager: You tell the API to do something, and it does it immediately. This is how Pandas works.
- Lazy: You stack a series of operations (loading the data, filtering the data, summarizing the data). No work is done at this point. Once all the operations are queued up, you tell the library to execute them all in one go. This is how Dask works.
Lazy execution gives the implementation plenty of opportunity for automatic optimization.
For example, if you’re not going to use column "endtime"
in your analysis, there’s no point in even loading it.
In contrast, a eager API has no idea how you will be using the data next, so it just has to do what you told it, with no shortcuts.
For IDA, we’re going to be spending time looking at intermediate steps in the processing, so a lazy API won’t work. On the other hand, an eager API is what you want in production usage since it will be able to run faster.
Unlike Pandas, the Polars DataFrame library provides both kinds of APIs. Since we’re starting with IDA, we’ll start with using the eager API.
Step 2. Figuring out Cambridge’s stations
I’m going to recreate, in the IPython command-line, the initial analysis I did in Jupyter.
First, we import Polars:
In [1]: import polars as pl
And we load the stations:
In [2]: stations = pl.read_csv("data/current_bluebikes_stations.csv", skip_rows=1, columns=["Number", "Name", "District", "Public"])
In [3]: stations
Out[3]:
shape: (448, 4)
------------------------------------------------
│ Number ┆ Name ┆ District ┆ Public │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
------------------------------------------------
│ K32015 ┆ 1200 Beacon St┆ Brookline ┆ Yes │
│ W32006 ┆ 160 Arsenal ┆ Watertown ┆ Yes │
│ A32019 ┆ 175 N Harvard ┆ Boston ┆ Yes │
│ S32035 ┆ 191 Beacon St ┆ Somerville ┆ Yes │
│ … ┆ … ┆ … ┆ … │
│ A32043 ┆ Western Ave at┆ Boston ┆ Yes │
│ ┆ St ┆ ┆ │
│ B32059 ┆ Whittier St He┆ Boston ┆ Yes │
│ D32040 ┆ Williams St at┆ Boston ┆ Yes │
│ ┆ St ┆ ┆ │
│ S32005 ┆ Wilson Square ┆ Somerville ┆ Yes │
------------------------------------------------
And now we choose only the Cambridge stations:
In [4]: cambridge_stations = stations.filter(pl.col("District") == "Cambridge")
In [5]: cambridge_stations
Out[5]:
shape: (78, 4)
-------------------------------------------------------
│ Number ┆ Name ┆ District ┆ Public │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
-------------------------------------------------------
│ M32026 ┆ 359 Broadway - Broadwa┆ Cambridge ┆ Yes │
│ ┆ Fayet… ┆ ┆ │
│ M32054 ┆ 699 Mt Auburn St ┆ Cambridge ┆ Yes │
│ M32060 ┆ 700 Huron Ave ┆ Cambridge ┆ Yes │
│ M32064 ┆ 75 Binney St ┆ Cambridge ┆ Yes │
│ … ┆ … ┆ … ┆ … │
│ M32048 ┆ Third at Binney ┆ Cambridge ┆ Yes │
│ M32040 ┆ University Park ┆ Cambridge ┆ Yes │
│ M32057 ┆ Vassal Lane at Tobin/V┆ Cambridge ┆ Yes │
│ M32050 ┆ Verizon Innovation Hub┆ Cambridge ┆ Yes │
│ ┆ Ware S… ┆ ┆ │
-------------------------------------------------------
According to this, there are 78 stations in Cambridge. So let’s validate that by checking the relevant page on the city website. It says: “The Bluebikes system has nearly 500 bike share stations, with over 80 located in Cambridge (as of April 2023).”
That doesn’t match the number we found. It’s possible some stations were installed after this CSV was created, but that seems unlikely given construction season has only just started. Or maybe the city website is wrong.
Step 3. Loading the rides
Let’s load all the rides. We’ll need them anyway, and that will give us a way to cross-check stations, by seeing if there are stations in the list of rides that aren’t in the list of stations.
In [6]: rides = pl.read_csv("data/20*.csv")
...
ComputeError: Could not parse `\N` as dtype `i64` at column 'birth year' (column number 14).
The current offset in the file is 760019 bytes.
That’s annoying: one of the columns has corrupt data, or perhaps it uses that symbol to indicate unknown data. This is apparently the birth year of the rider on each ride, which we don’t care about right now.
Let’s look at the columns in the CSVs and try to pick only the ones we do care about:
$ head -1 data/202212-bluebikes-tripdata.csv
"tripduration","starttime","stoptime","start station id","start station name","start station latitude","start station longitude","end station id","end station name","end station latitude","end station longitude","bikeid","usertype","postal code"
Notice what’s missing?
"birth year"
isn’t in the list!
Apparently only some of the CSVs have that column, which means we have inconsistent columns across CSVs.
That’s no fun.
Luckily, picking only the columns we want allows us to load all the CSVs into a single DataFrame
:
In [7]: rides = pl.read_csv("data/20*.csv", columns=["starttime", "start station id", "end station id", "start station name", "end station name"])
This takes a few seconds, and now we have all the rides loaded.
Step 4. Cross-referencing station names
We have two sets of stations: a CSV with station names and locations, and the start and end station for each ride. Let’s make sure they match, and see if we can find the mysterious missing Cambridge stations.
This sounds like a job for a set
!
In [8]: ride_stations = set()
In [9]: for column in ["start station name", "end station name"]:
...: ride_stations |= set(rides.get_column(column).to_list())
In [10]: len(ride_stations)
Out[10]: 690
Looking back we see that the stations CSV had 448 rows, and 690 != 448
.
Let’s inspect the data and see what’s going on, with commentary interleaved:
In [12]: ride_stations
Out[12]:
{'1200 Beacon St',
'160 Arsenal',
'175 N Harvard St',
'18 Dorrance Warehouse',
'191 Beacon St',
'2 Hummingbird Lane at Olmsted Green',
# (WHAT, oh there's an extra period)
'30 Dane St',
'30 Dane St.',
# (Oh, there's former stations although this one seems strange...)
'30 Dane St. (former)',
'359 Broadway - Broadway at Fayette Street',
# (And temporary winter locations)
'515 Somerville Ave (Temp. Winter Location)',
...
# (Sometimes with different spelling!)
'Charles Circle - Charles St at Cambridge St TEMPORARY WINTER LOCATION',
...
Well that’s fun: former stations, inconsistent spelling, winter relocations… We can also see which stations are in the rides that don’t exist in the stations CSV, to see if there’s any other categories of inconsistency we’re missing.
Step 5. A different approach to finding stations in Cambridge
But!
Let’s rethink first: we are trying to find rides going to or from Cambridge. One way to do that is to match to the “district” the station is in, information that is only in the stations CSV. But is there another way?
Let’s look again the columns we have available for rides:
"tripduration","starttime","stoptime","start station id","start station name","start station latitude","start station longitude","end station id","end station name","end station latitude","end station longitude","bikeid","usertype","postal code"
Longitude and latitude for each station do give us enough information to find the city, if we run it through a geocoder. Postal code would be even easier, since I’m fairly certain Cambridge ZIP codes don’t cross city boundaries, except—there’s only one, and we want both start station and end station.
Lovely.
Step 6. Fuck it, let’s just generate a report
Probably we want to do geocoding. But the point here is to talk about Polars, so maybe we’re digressing too much.
So let’s just pretend that the stations CSV is complete and accurate, and find all the rides either starting or finishing in Cambridge, using the cambridge_stations
we calculated earlier:
In [13]: cambridge_stations = cambridge_stations.get_column("Name")
In [14]: cambridge_rides = rides.filter(pl.col("start station name").is_in(cambridge_stations) | pl.col("end station name").is_in(cambridge_stations))
In [15]: cambridge_rides
Out[15]:
shape: (7139138, 5)
-------------------------------------------------------
│ startti ┆ start ┆ end ┆ start ┆ end │
│ me ┆ station┆ station┆ station ┆ station │
│ --- ┆ id ┆ id ┆ name ┆ name │
│ str ┆ --- ┆ --- ┆ --- ┆ --- │
│ ┆ i64 ┆ i64 ┆ str ┆ str │
-------------------------------------------------------
│ 2016-01 ┆ 36 ┆ 67 ┆ Boston ┆ MIT at │
│ -01 00: ┆ ┆ ┆ Public ┆ Mass Ave / │
│ 15:36 ┆ ┆ ┆ Library - ┆ Amherst St │
│ ┆ ┆ ┆ 700 Boyl… ┆ │
│ 2016-01 ┆ 107 ┆ 176 ┆ Ames St at ┆ Lesley │
│ -01 00: ┆ ┆ ┆ Main St ┆ University │
│ 25:13 ┆ ┆ ┆ ┆ │
│ 2016-01 ┆ 141 ┆ 90 ┆ Kendall ┆ Lechmere │
│ -01 00: ┆ ┆ ┆ Street ┆ Station at │
│ 25:24 ┆ ┆ ┆ ┆ Cambridge │
│ ┆ ┆ ┆ ┆ St… │
│ 2016-01 ┆ 178 ┆ 80 ┆ MIT ┆ MIT Stata │
│ -01 00: ┆ ┆ ┆ Pacific St ┆ Center at │
│ 45:02 ┆ ┆ ┆ at ┆ Vassar St │
│ ┆ ┆ ┆ Purrington ┆ / … │
│ ┆ ┆ ┆ St ┆ │
│ … ┆ … ┆ … ┆ … ┆ … │
│ 2023-02 ┆ 97 ┆ 67 ┆ Harvard ┆ MIT at │
│ -28 23: ┆ ┆ ┆ University ┆ Mass Ave / │
│ 42:36.4 ┆ ┆ ┆ River ┆ Amherst St │
│ 470 ┆ ┆ ┆ Houses … ┆ │
│ 2023-02 ┆ 334 ┆ 145 ┆ Mass Ave ┆ Rindge │
│ -28 23: ┆ ┆ ┆ at Hadley/ ┆ Avenue - │
│ 43:10.7 ┆ ┆ ┆ Walden ┆ O'Neill │
│ 420 ┆ ┆ ┆ ┆ Library │
│ 2023-02 ┆ 437 ┆ 68 ┆ Berkshire ┆ Central │
│ -28 23: ┆ ┆ ┆ Street at ┆ Square at │
│ 54:01.4 ┆ ┆ ┆ Cambridge ┆ Mass Ave / │
│ 120 ┆ ┆ ┆ St… ┆ Ess… │
│ 2023-02 ┆ 553 ┆ 413 ┆ Cambridge ┆ Kennedy-Lo │
│ -28 23: ┆ ┆ ┆ Crossing ┆ ngfellow │
│ 58:20.9 ┆ ┆ ┆ at North ┆ School 158 │
│ 490 ┆ ┆ ┆ Firs… ┆ Sp… │
-------------------------------------------------------
More than 7 million rides, not bad! And this is almost certainly an undercount, given we know we’re probably missing some stations.
Now that we have the rides, we can group them by year:
In [16]: def by_year(df):
...: return df.groupby_dynamic("starttime", every="1y").agg(pl.count())
...:
In [17]: by_year(cambridge_rides)
---------------------------------------------------------------------------
ComputeError Traceback (most recent call last)
Cell In[17], line 1
----> 1 by_year(cambridge_rides)
...
ComputeError: expected any of the following dtypes: { Date, Datetime, Int32, Int64 }, got str
Oops.
The starttime
column is a string, we forgot to parse it into a datetime.
Let’s do that.
In [18]: cambridge_rides = cambridge_rides.with_columns(pl.col("starttime").str.slice(0, 10).str.strptime(pl.Date, fmt="%Y-%m-%d", strict=False))
And now we can look at by year counts; I’ll leave making a graph as an exercise to the reader since the table is sufficient to show us that, with a brief speedbump from the pandemic, ridership is ever-increasing:
In [20]: by_year(cambridge_rides)
Out[20]:
shape: (8, 2)
------------------------
│ starttime ┆ count │
│ --- ┆ --- │
│ date ┆ u32 │
------------------------
│ 2016-01-01 ┆ 556273 │
│ 2017-01-01 ┆ 616359 │
│ 2018-01-01 ┆ 837028 │
│ 2019-01-01 ┆ 1152015 │
│ 2020-01-01 ┆ 804192 │
│ 2021-01-01 ┆ 1296830 │
│ 2022-01-01 ┆ 1720675 │
│ 2023-01-01 ┆ 155766 │
------------------------
One thing to notice is that 2023 is much lower, because it only has two months’ data; including the data would be misleading, unless we somehow marked it visually as incomplete. So for now we will just omit 2023:
In [21]: by_year(cambridge_rides.filter(pl.col("starttime").dt.year() < 2023))
Out[21]:
shape: (7, 2)
------------------------
│ starttime ┆ count │
│ --- ┆ --- │
│ date ┆ u32 │
------------------------
│ 2016-01-01 ┆ 556273 │
│ 2017-01-01 ┆ 616359 │
│ 2018-01-01 ┆ 837028 │
│ 2019-01-01 ┆ 1152015 │
│ 2020-01-01 ┆ 804192 │
│ 2021-01-01 ┆ 1296830 │
│ 2022-01-01 ┆ 1720675 │
------------------------
Interlude: What’s involved in initial data analysis?
As we’ve seen, during the initial data analysis, even for a seemingly trivial report, we had to deal with:
- Inconsistent input files (names and columns).
- Bad data (whatever the
"\N"
was in the"birth year"
column). - Inconsistent information across input types (station CSV has different stations than rides CSVs).
- Questions about how to get the most accurate data (maybe we should abandon the station list and just use the longitude/latitude of stations + geocoding?).
- Post-processing data (converting strings to dates).
- Thinking about different ways to represent data (should we include 2023?).
And more! I omitted some random typos, API usage mistakes, and other problems, just for clarity.
Because we didn’t know about these problems up front, an eager API where we immediately get back results was immensely useful in debugging problems, investigating alternatives, and working through issues.
Switching to a production script
Now that we have an outline of a working analysis, let’s convert it into a standalone script based on what we’ve learned.
An eager production script
As a first pass, we’ll continue using the eager API, basically the same code as above converted into a script:
import polars as pl
stations = pl.read_csv(
"data/current_bluebikes_stations.csv",
skip_rows=1,
columns=["Number", "Name", "District", "Public"],
)
cambridge_stations = stations.filter(
pl.col("District") == "Cambridge"
).get_column("Name")
rides = (
pl.read_csv(
"data/20*.csv",
columns=[
"starttime",
"start station id",
"end station id",
"start station name",
"end station name",
],
)
.with_columns(
pl.col("starttime")
.str.slice(0, 10)
.str.strptime(pl.Date, fmt="%Y-%m-%d", strict=False)
)
)
cambridge_rides = rides.filter(
pl.col("start station name").is_in(cambridge_stations)
| pl.col("end station name").is_in(cambridge_stations)
)
def by_year(df):
return df.groupby_dynamic("starttime", every="1y").agg(
pl.count()
)
print(
str(
by_year(
cambridge_rides.filter(
pl.col("starttime").dt.year() < 2023
)
)
)
)
If we run this, it takes:
- 3.2 wallclock seconds.
- 12.9 CPU seconds.
- 4.5 gigabytes of RAM.
Can we do better?
Note: Whether or not any particular tool or technique will speed things up depends on where the bottlenecks are in your software.
Need to identify the performance and memory bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production on macOS and Linux, and with built-in Jupyter support.
Switching to a lazy API
In the above example, each step along the way executed immediately, sometimes doing unnecessary work. This is how Pandas works as well. But Polars also includes a lazy API that only executes work once we’ve set up the whole query. This might allow it to optimize the queries accordingly.
For the stations CSV, we don’t both switching to the lazy API, because it’s so small (a few hundred rows) that it won’t make a difference. But for the millions of rows in the rides CSVs, we hope it will speed things up.
Switching to the lazy API involves swapping out polars.read_csv()
with polars.scan_csv()
, and at the end of the query calling collect()
to instantiate the final result.
import polars as pl
stations = pl.read_csv(
"data/current_bluebikes_stations.csv",
skip_rows=1,
columns=["Number", "Name", "District", "Public"],
)
cambridge_stations = stations.filter(
pl.col("District") == "Cambridge"
).get_column("Name")
# Switched to lazy API:
rides = (
pl.scan_csv("data/20*.csv") # 💖 replaced read_csv() 💖
.with_columns(
pl.col("starttime")
.str.slice(0, 10)
.str.strptime(pl.Date, fmt="%Y-%m-%d", strict=False)
)
)
cambridge_rides = rides.filter(
pl.col("start station name").is_in(cambridge_stations)
| pl.col("end station name").is_in(cambridge_stations)
)
def by_year(df):
return df.groupby_dynamic("starttime", every="1y").agg(
pl.count()
)
print(
str(
by_year(cambridge_rides)
.filter(pl.col("starttime").dt.year() < 2023)
.collect() # 💖 Added .collect() 💖
)
)
The change is tiny: two lines of code. We also no longer bother specifying which columns to load, since Polars can figure that out on its own, based on our query.
Here’s a comparison of both versions:
Mode | Wallclock time (sec) | CPU time (sec) | Peak memory (GB) |
---|---|---|---|
Eager | 3.2 | 12.9 | 4.5 |
Lazy | 2.3 | 10.2 | 1.6 |
We have reduced CPU usage, improved parallelism allowing the results to came back even faster, and massively reduced memory usage. Not bad!
The downside is that we don’t have access to intermediate results—the query only does exactly what we asked for, and only once we call collect()
.
But for production use, we don’t care, we’ve already figured out the final query to run.
What about Pandas?
Pandas only offers the equivalent of Polars’ eager mode, and mostly doesn’t have parallelism except maybe when loading CSVs. So while I haven’t implemented the above query in Pandas, I would expect it to be no better than the Polars eager mode, and quite likely much worse.
Comparing eager and lazy timelines
We can use the Sciagraph profiler to compare how these two modes work.
In addition to the more usual flamegraphs which it generates for both speed and memory usage, Sciagraph also generates timelines, which makes it easier to see parallelism and order of operation. These are best viewed on a large window on a desktop computer; they will be hard to view on a phone.
Eager version timeline
Notice we have a series of distinct operations, each triggered by different Python code:
Lazy version timeline
In the lazy version we only have one single Python operation, the collect()
, triggering lots of work happening internally in Polars.
We can also see that parallelism happens more often in the Polars thread pool.
Give Polars a try!
The ability to do both exploratory data analysis and final production code using the same library is very compelling. What’s more, Polars is fast, and it lets you take advantage of multiple cores. And once you’ve nailed down your queries, switch to lazy mode and everything will run even faster.