Using Polars in a Pandas world
Polars is a dataframe-based library that can be faster, more memory efficient, and often simpler to use than Pandas. It’s also much newer, and correspondingly less popular. In November 2023:
- Polars had ~2.6 million downloads from PyPI.
- Pandas had ~140 million downloads!
Because of Pandas’ popularity and decade and a half of availability, there are many third-party libraries with built-in support for Pandas, and others that specifically extend Pandas. Many plotting and visualization libraries will accept Pandas dataframes as an input, for example, and GeoPandas adds geographical data types to Pandas dataframes. If you’re using Polars, can you use these libraries? And if so, how?
In this article we’ll cover the various integration options you have between Polars and third-party libraries originally designed to work with Pandas:
- It Just Works, perhaps via the Dataframe Interchange Protocol or the Dataframe API Standard.
- Manual conversion to Pandas dataframes, and how to do it with essentially zero cost.
- Manual interoperability via files.
- Other alternatives.
Update: Added info about the Dataframe Interchange Protocol and link to Dataframe API client library, thanks to Marco Gorelli.
Some libraries work out-of-the-box with Polars
The first thing to do is check the documentation for the libraries you’re using. It’s possible they already support Polars with no extra work on your part, or only minimal work to install some dependencies.
There are multiple ways a Python library that originally only supported Pandas can end up supporting Polars dataframes transparently:
- It natively supports Polars.
- It converts Polars dataframes to Pandas dataframes by calling
to_pandas()on Polars objects. This can have some performance and memory impact, or not, depending on how the library implements it. See the section below on manual conversion for details.
- It uses the Dataframe Interchange Protocol, a standardized API for converting between different dataframe formats.
- It uses the Dataframe API Standard. This standard is a way for 3rd party libraries to support all the Python dataframe libraries with one standard API: direct interaction, rather than conversion. That means Pandas, Polars, Dask, cuDF, Modin, Koalas, PySpark, Vaex, Ibix and whoever else chooses to implement the API on the dataframe side. In practice it’s still a work-in-progress, but some libraries are starting to add support, and once it stabilizes it should become more common. (If you’re maintaining a library and would like to try using the API, see this compatibility library).
For example, if you read the documentation for Plotly Express you’ll learn that v5.15 uses
to_pandas() (method 2), and v5.16 uses both that and the (experimental for now) Dataframe Interchange Protocol (method 3).
So, to check if a library supports Polars, search for both “polars”, “dataframe interchange”, “dataframe API” and other related phrases in the documentation.
Manual in-memory conversion with
Even if the library you want to use won’t accept Polars objects, you can convert to Pandas objects using their
You can convert from Pandas to Polars using
The obvious two questions are:
- How slow is this?
- Will it double (or more) your memory usage when you create the converted dataframe or series?
The answer: it depends.
- In Pandas 1.x,
DataFramesare implemented with NumPy arrays as the underlying data model.
- In Pandas 2.x, for now NumPy is still supported and often the default, but it’s also possible to use Apache Arrow, an in-memory format specifically designed for dataframes. Over time, more and more of Pandas is using Arrow by default.
- Polars is natively based on Arrow as well.
polars.DataFrame.to_pandas() by default uses NumPy arrays, so it will have to convert all your data from Arrow to Numpy; this will double your memory usage at least, and take some computation too.
If you use Arrow, however, the conversion will take essentially no time and no extra memory is needed (“zero copy”).
Ensuring cheap, zero-copy conversion from Polars to Pandas
To ensure you don’t use extra memory or CPU:
- Make sure to use the latest Pandas, 2.0 or later.
- Make sure to install
pipor Conda, or your Python package manager of choice).
df.to_pandas(use_pyarrow_extension_array=True)on the Polars
A quick benchmark
I loaded a compressed 20MB Parquet file, and measured peak memory usage and CPU of running the following script:
import polars as pl
use_pyarrow = bool(int(sys.argv))
df = pl.read_parquet("test.parquet")
pandas_df = df.to_pandas(use_pyarrow_extension_array=use_pyarrow)
I ran it three times: once using PyArrow, once not, and once not converting to Pandas at all (i.e. the last line of code commented out), as a baseline.
|Peak Memory usage
|Zero-copy to Arrow
|Converted to NumPy
“Zero-copy” doesn’t mean zero allocations, though perhaps it’s just a one-time allocation in some library, but Arrow-based conversion is clearly very low overhead. But even the NumPy-based conversion wasn’t particularly expensive, just 150ms for a ~250MB dataframe, so assuming you have enough memory it might not be a concern either.
The ability to do zero-copy conversions between Pandas and Polars makes it very easy to share data across libraries. But there are other options as well:
- File-based interoperability: Instead of converting in-memory, you can also write your data to a file from Polars, and then read the file in Pandas, and vice versa. I recommend using the Parquet format, as it’s well-supported by both. In practice, zero-copy conversion means this is rarely going to be necessary.
- Alternative, Polars-specific libraries: Polars-specific libraries are still rare, but some do exist, like the
polars-businesspackage for doing business day calculations.
- Giving up and using Pandas: If you’re doing heavy geographical work, there will likely someday be a replacement for GeoPandas, but for now you probably going to spend a lot of time using Pandas.
Polars is not an island, and interoperability with Pandas is straightforward. And with Pandas 2, most of the time it should have no performance or memory impact.