Fil: A memory profiler for Python

Your Python code reads some data, processes it, and uses too much memory; maybe it even dies due to an out-of-memory error. In order to reduce memory usage, you first need to figure out:

  1. Where peak memory usage is, also known as the high-water mark.
  2. What code was responsible for allocating the memory that was present at that peak moment.

That’s exactly what Fil will help you find. Fil an open source memory profiler designed for data processing applications written in Python, and includes native support for Jupyter.

Fil comes in two editions:

  • The open source edition: designed for offline profiling. It has enough of a performance impact that you won’t want to use it on production workloads, but it can profile even small amounts of memory.
  • The commercial, production version: optimized for data-intensive programs that allocate large amounts of memory, is is fast enough to run on all your production data processing batch jobs.

Getting started with Fil

In this section you’ll learn how to:

Installing Fil

Fil requires macOS or Linux, and Python 3.6 or Later. You can either use Conda, a sufficiently new version of Pip, or higher-level tools like Poetry or Pipenv.

Conda

To install on Conda:

$ conda install -c conda-forge filprofiler

Pip (or similar tools)

To install the latest version of Fil you’ll need Pip 19 or newer. You can check the current version like this:

$ pip --version
pip 19.3.0

If you’re using something older than v19, you can upgrade by doing:

$ pip install --upgrade pip

If that doesn’t work, try running your code in a virtualenv (always a good idea in general):

$ python3 -m venv venv/
$ source venv/bin/activate
(venv) $ pip install --upgrade pip

Assuming you have a new enough version of pip, you can now install Fil:

$ pip install filprofiler

Using Fil for the first time

First, install Fil.

Then, create a Python file called example.py with the following code:

import numpy as np

def make_big_array():
    return np.zeros((1024, 1024, 50))

def make_two_arrays():
    arr1 = np.zeros((1024, 1024, 10))
    arr2 = np.ones((1024, 1024, 10))
    return arr1, arr2

def main():
    arr1, arr2 = make_two_arrays()
    another_arr = make_big_array()

main()

Now, you can run it with Fil:

$ fil-profile run example.py

This will run the program under Fil, and pop up the results.

In the next section, we’ll look at the results and see what they tell us.

Understanding Fil’s output

Let’s look at the result of the Fil run from the previous section:


What does this mean?

What you’re seeing is a flamegraph, a visualization that shows a tree of callstacks and which ones were most expensive. In Fil’s case, it shows the callstacks responsible for memory allocations at the point in time when memory usage was highest.

The wider or redder the frame, the higher percentage of memory that function was responsible for. Each line is an additional call in the callstack.

This particular flamegraph is interactive:

  • Click on a frame to see a zoomed in view of that part of the callstack. You can then click “Reset zoom” in the upper left corner to get back to the main overview.
  • Hover over a frame with your mouse to get additional details.

To optimize your code, focus on the wider and redder frames. These are the frames that allocated most of the memory. In this particular example, you can see that the most memory was allocated by a line of code in the make_big_array() function.

Having found the source of the memory allocations at the moment of peak memory usage, you can then go and reduce memory usage. You can then validate your changes reduced memory usage by re-running your updated program with Fil and comparing the result.

Understanding Fil

In this section you’ll learn:

Fil vs other Python memory tools

There are two distinct patterns of Python usage, each with its own source of memory problems.

In a long-running server, memory usage can grow indefinitely due to memory leaks. That is, some memory is not being freed.

  • If the issue is in Python code, tools like tracemalloc and Pympler can tell you which objects are leaking and what is preventing them from being leaked.
  • If you’re leaking memory in C code, you can use tools like Valgrind.

Fil, however, is not specifically aimed at memory leaks, but at the other use case: data processing applications. These applications load in data, process it somehow, and then finish running.

The problem with these applications is that they can, on purpose or by mistake, allocate huge amounts of memory. It might get freed soon after, but if you allocate 16GB RAM and only have 8GB in your computer, the lack of leaks doesn’t help you.

Fil will therefore tell you, in an easy to understand way:

  1. Where peak memory usage is, also known as the high-water mark.
  2. What code was responsible for allocating the memory that was present at that peak moment.
  3. This includes C/Fortran/C++/whatever extensions that don’t use Python’s memory allocation API (tracemalloc only does Python memory APIs).

This allows you to optimize that code in a variety of ways.

How Fil works

Fil uses the LD_PRELOAD/DYLD_INSERT_LIBRARIES mechanism to preload a shared library at process startup. This is why Fil can’t be used as regular library and needs to be started in a special way: it requires setting up the correct environment before Python starts.

This shared library intercepts all the low-level C memory allocation and deallocation API calls, and keeps track of the corresponding allocation. For example, instead of a malloc() memory allocation going directly to your operating system, Fil will intercept it, keep note of the allocation, and then call the underlying implementation of malloc().

At the same time, the Python tracing infrastructure (the same infrastructure used by cProfile and coverage.py) is used to figure out which Python callstack/backtrace is responsible for each allocation.

How to use Fil

In this section you will learn how to use Fil to profile:

You will also learn how to use Fil to debug:

Profiling complete Python programs

You want to get a memory profile of your Python program end-to-end, from when it starts running to when it finishes.

Profiling Python scripts

Let’s say you usually run your program like this:

$ python yourscript.py --input-file=yourfile

Just do:

$ fil-profile run yourscript.py --input-file=yourfile

And it will generate a report and automatically try to open it in for you in a browser. Reports will be stored in the fil-result/ directory in your current working directory.

You can also use this alternative syntax:

$ python -m filprofiler run yourscript.py --input-file=yourfile

Profiling Python modules (python -m)

If your program is usually run as a module:

$ python -m yourapp.yourmodule --args

You can run it with Fil like this:

$ fil-profile run -m yourapp.yourmodule --args

Or like this:

$ python -m filprofiler run -m yourapp.yourmodule --args

Profiling in Jupyter

To measure peak memory usage of some code in Jupyter you need to do three things:

Using Fil in Jupyter

1. Use “Python 3 with Fil” kernel

Jupyter notebooks run with a particular “kernel”, which most of the time just determines which programming language the notebook is using, like Python or R. Fil support in Jupyter requires a special kernel, so instead of using the “Python 3” kernel you’ll use the “Python 3 with Fil” kernel.

There are two ways to choose this kernel:

  1. You can choose this kernel when you create a new notebook.
  2. You can switch an existing notebook in the Kernel menu. There should be a “Change Kernel” option in there in both Jupyter Notebook and JupyterLab.

2. Load the extension

In one of the cells in your notebook, add this to the cell:

%load_ext filprofiler

3. Profiling a particular cell

You can now do memory profiles of particular cells by adding %%filprofile as the first line of the cell.

  1. Load the extension by doing %load_ext filprofiler.
  2. Add the %%filprofile magic to the top of the cell with the code you wish to profile.

An example

Here’s an example session:

Screenshot of JupyterLab session

Profiling a subset of your Python program

Sometimes you only want to profile your Python program part of the time. For this use case, Fil provides a Python API.

Important: This API turns profiling on and off for the whole process! If you want more fine grained profiling, e.g. per thread, please file an issue.

Using the Python API

1. Add profiling in your code

Let’s you have some code that does the following:

def main():
    config = load_config()
    result = run_processing(config)
    generate_report(result)

You only want to get memory profiling for the run_processing() call.

You can do so in the code like so:

from filprofiler.api import profile

def main():
    config = load_config()
    result = profile(lambda: run_processing(config), "/tmp/fil-result")
    generate_report(result)

You could also make it conditional, e.g. based on an environment variable:

import os
from filprofiler.api import profile

def main():
    config = load_config()
    if os.environ.get("FIL_PROFILE"):
        result = profile(lambda: run_processing(config), "/tmp/fil-result")
    else:
        result = run_processing(config)
    generate_report(result)

2. Run your script with Fil

You still need to run your program in a special way. If previously you did:

$ python yourscript.py --config=myconfig

Now you would do:

$ filprofiler python yourscript.py --config=myconfig

Notice that you’re doing filprofiler python, rather than filprofiler run as you would if you were profiling the full script. Only functions running for the duration of the filprofiler.api.profile() call will have memory profiling enabled, including of course the function you pass in. The rest of the code will run at (close) to normal speed and configuration.

Each call to profile() will generate a separate report. The memory profiling report will be written to the directory specified as the output destination when calling profile(); in or example above that was "/tmp/fil-result". Unlike full-program profiling:

  1. The directory you give will be used directly, there won’t be timestamped sub-directories. If there are multiple calls to profile(), it is your responsibility to ensure each call writes to a unique directory.
  2. The report(s) will not be opened in a browser automatically, on the presumption you’re running this in an automated fashion.

Debugging out-of-memory crashes using Fil

Typically when your program runs out of memory, it will crash, or get killed mysteriously by the operating system, or other unfortunate side-effects.

To help you debug these problems, Fil will heuristically try to catch out-of-memory conditions, and dump a report if thinks your program is out of memory. It will then exit with exit code 53.

$ fil-profile run oom.py 
...
=fil-profile= Wrote memory usage flamegraph to fil-result/2020-06-15T12:37:13.033/out-of-memory.svg

Fil uses three heuristics to determine if the process is close to running out of memory:

  • A failed allocation, indicating insufficient memory is available.
  • The operating system or memory-limited cgroup (e.g. a Docker container) only has 100MB of RAM available.
  • The process swap is larger than available memory, indicating heavy swapping by the process. In general you want to avoid swapping, and e.g. explicitly use mmap() if you expect to be using disk as a backfill for memory.

For a more detailed example of out-of-memory detection with Fil, see this article on debugging out-of-memory crashes.

Disabling the out-of-memory detection

Sometimes the out-of-memory detection heuristic will kick in too soon, shutting down the program even though in practice it could finish running. You can disable the heuristic by doing:

fil-profile --disable-oom-detection run yourprogram.py

Debugging memory leaks with Fil

Is your program suffering from a memory leak? You can use Fil to debug it.

Fil works by reporting the moment in your process lifetime where memory is highest. If your program has a memory leak, eventually the highest memory usage point is always the present, as leaked memory accumulates.

If for example your Python web application is leaking memory, you can:

  1. Start it under Fil.
  2. Generate lots of traffic that causes memory leaks.
  3. When enough memory has leaked that it’s noticeable, cleanly kill the process (e.g. Ctrl-C).

Fil will then dump a report that will help pinpoint the leaking code.

For a more in-depth tutorial, read this article on debugging Python server memory leaks with Fil.

Disabling browser pop-up reports

By default, Fil will open the result of a profiling run in a browser.

As of version 2021.04.2, you can disable this by using the --no-browser option (see fil-profile --help for details). You will want to view the SVG report in a browser, since they rely heavily on JavaScript.

If you want to serve the report files from a static directory using a web server, you can do:

$ cd fil-result/
$ python -m http.server

Reference

Learn about:

What Fil tracks

Fil will track memory allocated by:

  • Normal Python code.
  • C code using malloc()/calloc()/realloc()/posix_memalign().
  • C++ code using new (including via aligned_alloc()).
  • Anonymous mmap()s.
  • Fortran 90 explicitly allocated memory (tested with gcc’s gfortran; let me know if other compilers don’t work).

Still not supported, but planned:

  • mremap() (resizing of mmap()).

Maybe someday:

  • File-backed mmap(). The semantics are somewhat different than normal allocations or anonymous mmap(), since the OS can swap it in or out from disk transparently, so supporting this will involve a different kind of resource usage and reporting.
  • Other forms of shared memory, need to investigate if any of them allow sufficient allocation.
  • Anonymous mmap()s created via /dev/zero (not common, since it’s not cross-platform, e.g. macOS doesn’t support this).
  • memfd_create(), a Linux-only mechanism for creating in-memory files.
  • memalign, valloc(), pvalloc(), reallocarray(). These are all rarely used, as far as I can tell.

Threading in NumPy (BLAS), Zarr, numexpr

In general, Fil will track allocations in threads correctly.

First, if you start a thread via Python, running Python code, that thread will get its own callstack for tracking who is responsible for a memory allocation.

Second, if you start a C thread, the calling Python code is considered responsible for any memory allocations in that thread.

This works fine... except for thread pools. If you start a pool of threads that are not Python threads, the Python code that created those threads will be responsible for all allocations created during the thread pool’s lifetime. Fil therefore disables thread pools for a number of commonly-used libraries.

Behavior impacts on NumPy (BLAS), Zarr, BLOSC, OpenMP, numexpr

Fil can’t know which Python code was responsible for allocations in C threads.

Therefore, in order to ensure correct memory tracking, Fil disables thread pools in BLAS (used by NumPy), BLOSC (used e.g. by Zarr), OpenMP, and numexpr. They are all set to use 1 thread, so calls should run in the calling Python thread and everything should be tracked correctly.

This has some costs:

  1. This can reduce performance in some cases, since you’re doing computation with one CPU instead of many.
  2. Insofar as these libraries allocate memory proportional to number of threads, the measured memory usage might be wrong.

Fil does this for the whole program when using fil-profile run. When using the Jupyter kernel, anything run with the %%filprofile magic will have thread pools disabled, but other code should run normally.

Limitations

Limited reporting of tiny allocations

While every single allocation is tracked, for performance reasons only the largest allocations are reported, with a minimum of 99% of allocated memory reported. The remaining <1% is highly unlikely to be relevant when trying to reduce usage; it’s effectively noise.

No support for subprocesses

This is planned, but not yet implemented.

Missing memory allocation APIs

See the list in the page on what Fil tracks.

No support for third-party allocators

On Linux, Fil replaces the standard glibc allocator with jemalloc, though this is an implementation detail that may change in the future.

On all platforms, Fil will not work with custom allocators like jemalloc or tcmalloc.

Getting help

If you need help using Fil:

Understanding Fil4prod

In this section you’ll learn:

Fil4prod vs open source Fil

Fil is an open source memory profiler for Python; Fil4prod is a proprietary memory profiler for Python. The difference is in their focus and capabilities.

Fil: offline memory profiling

Fil’s goal is to help data scientists, scientists, data engineers, and programmers to identify memory usage problems in their code. In order to do so, it tracks every single memory allocation it can, large and small, and tries to make the profiling results as accurate as possible.

While this means Fil works well catching even small memory problems or leaks, it has downsides too:

  • Tracking all memory allocations has a performance cost.
  • In order to ensure good reporting Fil makes some changes to how certain libraries run.
  • Fil’s mechanism for catching out-of-memory problems is a useful heuristic, but as a heuristic it could adversely impact production jobs.

Overall, Fil aims to give the best possible information, at the cost of performance and behavior differences from uninstrumented code.

Fil4prod: memory profiling in production

Limiting memory profiling to developer machines has its limitations:

  • Some problems only occur in production.
  • If a process takes 64GB of RAM and 12 hours to run, reproducing the problem locally can be difficult and slow.

Ideally, every single production job would have memory profiling enabled by default, just in case. If memory usage is too high, you won’t have to go back and rerun it, you’d have a profiling report already prepared.

In order to achieve this, Fil4prod emphasizes speed over accuracy. In particular, Fil4prod uses sampling, only tracking a subset of memory allocations. While this means much lower performance overhead, it has some caveats:

  • Results will only be useful for processes that allocate large amounts of memory; 500MB of RAM or more, say.
  • Reported callstacks for the smallest allocations may be wrong.

In practice, for data-intensive batch jobs with high memory usage, both these caveats are irrelevant. If you’re trying to figure out why you’re using 16GB of RAM, you’ll care about the multi-gigabyte or 100s-of-megabyte sources of allocation, and the fact that a 1MB allocation is reporting the wrong callstack doesn’t really matter.

Writing software that’s reliable enough for production

How do you write software that’s reliable enough to run in production?

Fil4prod is a Python memory profiler intended for always-on profiling of production data processing jobs. Critically, Fil4prod runs inside the process that is being profiled. As a result, failure in Fil4prod can in theory crash the user’s program, or even corrupt data. This clearly, is unacceptable.

As I implemented Fil4prod I had to address this problem to my satisfaction. Here is what I have done, and what I plan to do next.

Guiding principles

  1. Do no harm. Failures in Fil4prod should not affect the running program.
  2. Fail fast. If breaking the running program is unavoidable, fail as early as possible, and with a meaningful error.

Choice of programming language: Rust

Writing a memory profiler has certain constraints:

  1. It needs to be fast, since it will be running in a critical code path.
  2. The language probably shouldn’t use garbage collection, both for performance reasons and since memory reentrancy issues are one of the more annoying causes of problems in memory profilers.

To expand on reentrancy: if you’re capturing malloc() calls, having the profiler then call malloc() itself is both a performance problem and a potential for recursively blowing up the stack. So it’s best to know exactly when allocation and memory freeing happens so that it can be done safely.

Rust fulfills both these criteria, but comes with many other benefits compared to C or C++:

  • Memory safety and thread safety.
  • Enforces handling all possible values of enums; Rust’s compiler will complain if you don’t handle all cases.
  • Similarly, Result objects (the main way to get errors) must be handled, you can’t just drop them on the floor.
  • No NULL or nil; there’s Option<T>, but the branch coverage requirement means both cases will be handled, and it’s explicitly nullable, vs. e.g. C or C++ where any pointer can be NULL.

Caveat: unsafe in libraries

Rust has an escape hatch from its safety model: unsafe. Third-party libraries that Fil4prod depends on can use this to cause undefined behavior bugs, much like C/C++ code.

I am trying to mitigate this by choosing popular libraries that have had some real-world testing, but longer term might also change my choice of libraries.

Caveat: unsafe in Fil4prod

Fil4prod spends much of its time talking to C APIs: for memory allocation, and to deal with CPython interpreter internals. Doing so inherently requires opting out of Rust’s safety.

This is mitigated by providing safe wrappers around unsafe APIs. For example, to ensure I’m not passing around a pointer that might be NULL, I could do:


#![allow(unused)]
fn main() {
/// Wrapper around void* that maps to an allocation from libc.
pub struct Allocation {
    // If pointer is NULL this will be `None`, otherwise `Some(pointer)`.
    pointer: Option<*mut c_void>,
}

impl Allocation {
    // Wrap a new pointer.
    pub fn wrap(pointer: *mut c_void) -> Self {
        let pointer = if pointer.is_null() {
            None
        } else {
            Some(pointer)
        };
        Self { pointer }
    }
    
    pub fn malloc(size: usize) -> Self {
        Self::wrap(unsafe { libc::malloc(size) })
    }

    // ... other APIs
}
}

The use of Option means any time I try to get at the underlying pointer, Rust’s compiler will complain if the None case isn’t handled, so long as the unwrap() and expect() APIs aren’t used. (An even more succinct implementation would use std::ptr::NonNull::new().)

One approach I haven’t taken is trying Miri, a tool that will catch some bugs in unsafe code. From the documentation, it seems like it won’t work with FFI. Since FFI is the only reason Fil4prod uses unsafe, it seems like Miri would be both be difficult or impossible to use, and not particularly helpful.

Caveat: Rust limitations

  • Rust’s thread locals aren’t quite sufficient (they’re slow and using them can allocate memory, which means reentrancy).
  • Implementing C-style variable arguments to function is not yet supported; this is necessary for capturing mremap().

Hopefully both issues will be fixed in stable Rust; for now there’s a tiny bit of C code required.

Preventing panics in Rust

Certain APIs in Rust will panic if the data is in an unexpected state, e.g. Option<T>::unwrap() will panic if the value is None. Unlike a segfault, panics are thread-specific and can be recovered from. But while panics in Fil’s internal thread can be handled gracefully, panics in the application threads could take down the whole program if they hit FFI boundaries. The goal then is to avoid panics as much as possible.

Sometimes this is done by using non-panicking APIs. In the case of Option<T>, there are other APIs to extract T that will not panic.

In other cases, panics can be avoided by appropriate error handling. In a normal program, shutting down might be fine if log initialization fails, but Fil4prod should just keep running and live without logs.

In order to enforce a lack of panics, the Clippy linter is used to catch Rust APIs that can cause panics. Normal integer arithmetic is also avoided to avoid bugs caused by overflows; saturating APIs are used instead.


#![allow(unused)]
#![deny(
fn main() {
    clippy::expect_used,
    clippy::unwrap_used,
    clippy::ok_expect,
    clippy::integer_division,
    clippy::indexing_slicing,
    clippy::integer_arithmetic,
    clippy::panic,
    clippy::match_on_vec_items,
    clippy::manual_strip,
    clippy::await_holding_refcell_ref
)]
}

Clippy is run as part of CI.

Additionally assert_panic_free is used to assert that there are no panics in critical code paths (see also: no_panic, dont_panic, panic-never). Unfortunately the way it works is not ideal, insofar as it doesn’t identify which particular code had a panic, and since it can’t always deduce correctly whether code is panic free.

When panics might happen anyway

Given the use of third-party libraries, there are parts of the code where it’s much harder to prove that panics are impossible. In these situations, I use std::panic::catch_unwind to catch any panics that might occur.

In addition, a panic hook is used to disable profiling on panics; when this happens, the user’s Python program should hopefully continue as normal.

Prototyping

The open source Fil profiler acted as a prototype for Fil4prod. By writing Fil first:

  • I was able to spot potential issues in advance (sometimes by encountering them in the wild).
  • Some of the code is shared between the two, and as a result has had real-world testing before Fil4prod was even released.
  • Fil4prod is in some ways a redesign, based on lessons learned from Fil.

Automated testing

Fil4prod has plenty of automated tests, both low-level unit tests and end-to-end tests. Some points worth covering:

Coverage marks

One useful technique when testing is coverage mark: the ability to mark a certain branch in the code, and then have a test assert “that branch was called in this test.” Much of what Fil4prod does is pretending to be exactly the same as normal malloc() while doing something slightly different internally for tracking purposes. Coverage marks allow me to ensure black-box tests are hitting the right code path.

For more details see here.

Property-based testing

When possible, property-based testing is used to generate a wide variety of test cases automatically. I’m using the proptest library for Rust.

End-to-end tests

Fil4prod is designed to run inside a Python process, so for reliable testing it is critical to have tests that run a full Python program with Fil4prod injected. The flawed “test pyramid” notion of lots of unit tests and only a tiny number of end-to-end tests doesn’t apply in this particular situation: it’s necessary to have plenty of both.

Contracts and debug assertions

Fil4prod uses pre- and post-contracts, plus debug assertions, to ensure invariants are being followed. Of course, these are disabled in the release build for performance reasons. So to ensure correctness, the end-to-end tests are actually run twice: once with the release build, and once with debug assertions enabled.

Panic injection testing

Some of Fil4prod’s test make certain “failpoints” panic, using a technique similar to fail. This allows testing that unexpected failures in Fil4prod won’t impact the running program.

Environmental assertions on startup

Fil4prod has certain environmental invariants: for example, matching the appropriate version of Python. This can happen: the open source version of Fil had a build system bug where code compiled against Python 3.6 was packaged for Python 3.9, leading to segfaults.

In addition, for performance reasons Fil4prod sometimes requires transgressive programming, violating abstraction boundaries and relying on internal details of glibc and CPython. These details are only likely to change every few years, with a major release, so it’s highly unlikely users will encounter them, but this is still a risk factor.

To prevent mysterious crashes, all of these invariants are tested on startup. If the checks fail, Fil4prod will cause the program to exit early with a useful error message. This is much better than segfaulting later in some arbitrary part of user code, which is both hard to debug and could in theory lead to corrupted user data.

Dependency due diligence

In selecting libraries to depend on, I try to pick reasonable dependencies; for example, all other things being equal, a library with a large user base is likely better than a library almost no one uses. But there are also some automated tests, in particular using Rust’s advisory database to ensure no dependencies have known security advisories, soundness issues, or are unmaintained.

Next steps, a partial list

User testing

There’s only so far internal processes can get you: testing software in production is in the end the only way to find certain problems. If you’d like to get memory profiling automatically for all your production batch jobs, send me an email to participate in the alpha program.

Rudra

Rudra is a static analyzer for Rust that can catch certain unsoundness issues in unsafe code. I should run it on Fil4prod.

Other potential approaches to panic reduction

findpanics is a tool that finds panics using binary analysis of compiled code. rustig is similar, but seems even less maintained.

Try to reduce unsafe in third-party libraries

All other things being equal, a library using unsafe is more likely to have unsoundness bugs than a library that doesn’t use safe. It may be possible to switch some of Fil4prod’s dependencies to safer alternatives.