Using Polars in a Pandas world
Polars is a dataframe-based library that can be faster, more memory efficient, and often simpler to use than Pandas. It’s also much newer, and correspondingly less popular. In November 2023:
- Polars had ~2.6 million downloads from PyPI.
- Pandas had ~140 million downloads!
Because of Pandas’ popularity and decade and a half of availability, there are many third-party libraries with built-in support for Pandas, and others that specifically extend Pandas. Many plotting and visualization libraries will accept Pandas dataframes as an input, for example, and GeoPandas adds geographical data types to Pandas dataframes. If you’re using Polars, can you use these libraries? And if so, how?
In this article we’ll cover the various integration options you have between Polars and third-party libraries originally designed to work with Pandas:
- It Just Works, perhaps via the Dataframe Interchange Protocol or the Dataframe API Standard.
- Manual conversion to Pandas dataframes, and how to do it with essentially zero cost.
- Manual interoperability via files.
- Other alternatives.
Update: Added info about the Dataframe Interchange Protocol and link to Dataframe API client library, thanks to Marco Gorelli.
Some libraries work out-of-the-box with Polars
The first thing to do is check the documentation for the libraries you’re using. It’s possible they already support Polars with no extra work on your part, or only minimal work to install some dependencies.
There are multiple ways a Python library that originally only supported Pandas can end up supporting Polars dataframes transparently:
- It natively supports Polars.
- It converts Polars dataframes to Pandas dataframes by calling
to_pandas()
on Polars objects. This can have some performance and memory impact, or not, depending on how the library implements it. See the section below on manual conversion for details. - It uses the Dataframe Interchange Protocol, a standardized API for converting between different dataframe formats.
- It uses the Dataframe API Standard. This standard is a way for 3rd party libraries to support all the Python dataframe libraries with one standard API: direct interaction, rather than conversion. That means Pandas, Polars, Dask, cuDF, Modin, Koalas, PySpark, Vaex, Ibix and whoever else chooses to implement the API on the dataframe side. In practice it’s still a work-in-progress, but some libraries are starting to add support, and once it stabilizes it should become more common. (If you’re maintaining a library and would like to try using the API, see this compatibility library).
For example, if you read the documentation for Plotly Express you’ll learn that v5.15 uses to_pandas()
(method 2), and v5.16 uses both that and the (experimental for now) Dataframe Interchange Protocol (method 3).
So, to check if a library supports Polars, search for both “polars”, “dataframe interchange”, “dataframe API” and other related phrases in the documentation.
Manual in-memory conversion with .to_pandas()
Even if the library you want to use won’t accept Polars objects, you can convert to Pandas objects using their polars.DataFrame.to_pandas()
and polars.Series.to_pandas()
.
You can convert from Pandas to Polars using polars.from_pandas()
.
The obvious two questions are:
- How slow is this?
- Will it double (or more) your memory usage when you create the converted dataframe or series?
The answer: it depends.
- In Pandas 1.x,
Series
andDataFrames
are implemented with NumPy arrays as the underlying data model. - In Pandas 2.x, for now NumPy is still supported and often the default, but it’s also possible to use Apache Arrow, an in-memory format specifically designed for dataframes. Over time, more and more of Pandas is using Arrow by default.
- Polars is natively based on Arrow as well.
polars.DataFrame.to_pandas()
by default uses NumPy arrays, so it will have to convert all your data from Arrow to Numpy; this will double your memory usage at least, and take some computation too.
If you use Arrow, however, the conversion will take essentially no time and no extra memory is needed (“zero copy”).
Ensuring cheap, zero-copy conversion from Polars to Pandas
To ensure you don’t use extra memory or CPU:
- Make sure to use the latest Pandas, 2.0 or later.
- Make sure to install
pyarrow
(withpip
or Conda, or your Python package manager of choice). - Call
df.to_pandas(use_pyarrow_extension_array=True)
on the PolarsDataFrame
.
A quick benchmark
I loaded a compressed 20MB Parquet file, and measured peak memory usage and CPU of running the following script:
import pyarrow
import pandas
import polars as pl
import sys
use_pyarrow = bool(int(sys.argv[1]))
df = pl.read_parquet("test.parquet")
pandas_df = df.to_pandas(use_pyarrow_extension_array=use_pyarrow)
I ran it three times: once using PyArrow, once not, and once not converting to Pandas at all (i.e. the last line of code commented out), as a baseline.
Conversion method | CPU time | Peak Memory usage |
---|---|---|
Didn’t convert | 1.39 secs | 390 MB |
Zero-copy to Arrow | 1.37 secs | 449 MB |
Converted to NumPy | 1.55 secs | 629 MB |
“Zero-copy” doesn’t mean zero allocations, though perhaps it’s just a one-time allocation in some library, but Arrow-based conversion is clearly very low overhead. But even the NumPy-based conversion wasn’t particularly expensive, just 150ms for a ~250MB dataframe, so assuming you have enough memory it might not be a concern either.
Other options
The ability to do zero-copy conversions between Pandas and Polars makes it very easy to share data across libraries. But there are other options as well:
- File-based interoperability: Instead of converting in-memory, you can also write your data to a file from Polars, and then read the file in Pandas, and vice versa. I recommend using the Parquet format, as it’s well-supported by both. In practice, zero-copy conversion means this is rarely going to be necessary.
- Alternative, Polars-specific libraries: Polars-specific libraries are still rare, but some do exist, like the
polars-business
package for doing business day calculations. - Giving up and using Pandas: If you’re doing heavy geographical work, there will likely someday be a replacement for GeoPandas, but for now you probably going to spend a lot of time using Pandas.
Polars is not an island, and interoperability with Pandas is straightforward. And with Pandas 2, most of the time it should have no performance or memory impact.