Writing software that's reliable enough for production

How do you write software that’s reliable enough to run in production?

Sciagraph is a profiler intended to support always-on profiling of production data processing jobs. Critically, Sciagraph runs inside the process that is being profiled. As a result, failure in Sciagraph can in theory crash the user’s program, or even corrupt data. This clearly, is unacceptable.

As I implemented Sciagraph I had to address this problem to my satisfaction. Here is what I have done, and what I plan to do next.

Guiding principles

Do no harm. Failures in Sciagraph should not affect the running program.
Fail fast. If breaking the running program is unavoidable, fail as early as possible, and with a meaningful error.

Choice of programming language: Rust

Writing a memory profiler has certain constraints:

It needs to be fast, since it will be running in a critical code path.
The language probably shouldn’t use garbage collection, both for performance reasons and since memory reentrancy issues are one of the more annoying causes of problems in memory profilers.

To expand on reentrancy: if you’re capturing malloc() calls, having the profiler then call malloc() itself is both a performance problem and a potential for recursively blowing up the stack. So it’s best to know exactly when allocation and memory freeing happens so that it can be done safely.

Rust fulfills both these criteria, but comes with many other benefits compared to C or C++:

Memory safety and thread safety.
Enforces handling all possible values of enums; Rust’s compiler will complain if you don’t handle all cases.
Similarly, Result objects (the main way to get errors) must be handled, you can’t just drop them on the floor.
No NULL or nil; there’s Option<T>, but the branch coverage requirement means both cases will be handled, and it’s explicitly nullable, vs. e.g. C or C++ where any pointer can be NULL.

Caveat: `unsafe` in libraries

Rust has an escape hatch from its safety model: unsafe. Third-party libraries that Sciagraph depends on can use this to cause undefined behavior bugs, much like C/C++ code.

I am trying to mitigate this by choosing popular libraries that have had some real-world testing, but longer term might also change my choice of libraries.

Caveat: `unsafe` in Sciagraph

Sciagraph spends much of its time talking to C APIs: for memory allocation, and to deal with CPython interpreter internals. Doing so inherently requires opting out of Rust’s safety.

This is mitigated by providing safe wrappers around unsafe APIs. For example, to ensure I’m not passing around a pointer that might be NULL, I could do:

/// Wrapper around void* that maps to an allocation from libc.
pub struct Allocation {
    // If pointer is NULL this will be `None`, otherwise `Some(pointer)`.
    pointer: Option<*mut c_void>,
}

impl Allocation {
    // Wrap a new pointer.
    pub fn wrap(pointer: *mut c_void) -> Self {
        let pointer = if pointer.is_null() {
            None
        } else {
            Some(pointer)
        };
        Self { pointer }
    }
    
    pub fn malloc(size: usize) -> Self {
        Self::wrap(unsafe { libc::malloc(size) })
    }

    // ... other APIs
}

The use of Option means any time I try to get at the underlying pointer, Rust’s compiler will complain if the None case isn’t handled, so long as the unwrap() and expect() APIs aren’t used. (An even more succinct implementation would use std::ptr::NonNull::new().)

Caveat: Rust limitations

Rust’s thread locals aren’t quite sufficient (they’re slow and using them can allocate memory, which means reentrancy).
Implementing C-style variable arguments to function is not yet supported; this is necessary for capturing mremap().

Hopefully both issues will be fixed in stable Rust; for now there’s a tiny bit of C code required.

Preventing panics in Rust

Certain APIs in Rust will panic if the data is in an unexpected state, e.g. Option<T>::unwrap() will panic if the value is None. Unlike a segfault, panics are thread-specific and can be recovered from. But while panics in Fil’s internal thread can be handled gracefully, panics in the application threads could take down the whole program if they hit FFI boundaries. The goal then is to avoid panics as much as possible.

Sometimes this is done by using non-panicking APIs. In the case of Option<T>, there are other APIs to extract T that will not panic.

In other cases, panics can be avoided by appropriate error handling. In a normal program, shutting down might be fine if log initialization fails, but Sciagraph should just keep running and live without logs.

In order to enforce a lack of panics, the Clippy linter is used to catch Rust APIs that can cause panics. Normal integer arithmetic is also avoided to avoid bugs caused by overflows; saturating APIs are used instead.

#![deny(
    clippy::expect_used,
    clippy::unwrap_used,
    clippy::ok_expect,
    clippy::integer_division,
    clippy::indexing_slicing,
    clippy::integer_arithmetic,
    clippy::panic,
    clippy::match_on_vec_items,
    clippy::manual_strip,
    clippy::await_holding_refcell_ref
)]

Clippy is run as part of CI.

Additionally assert_panic_free is used to assert that there are no panics in critical code paths (see also: no_panic, dont_panic, panic-never). Unfortunately the way it works is not ideal, insofar as it doesn’t identify which particular code had a panic, and since it can’t always deduce correctly whether code is panic free.

When panics might happen anyway

Given the use of third-party libraries, there are parts of the code where it’s much harder to prove that panics are impossible. In these situations, I use std::panic::catch_unwind to catch any panics that might occur.

In addition, a panic hook is used to disable profiling on panics; when this happens, the user’s Python program should hopefully continue as normal.

Prototyping

The open source Fil profiler acted as a prototype for Sciagraph. By writing Fil first:

I was able to spot potential issues in advance (sometimes by encountering them in the wild).
Some of the code is shared between the two, and as a result has had real-world testing before Sciagraph was even released.
Sciagraph is in some ways a redesign, based on lessons learned from Fil.

Automated testing

Sciagraph has plenty of automated tests, both low-level unit tests and end-to-end tests. Some points worth covering:

Coverage marks

One useful technique when testing is coverage mark: the ability to mark a certain branch in the code, and then have a test assert “that branch was called in this test.” Much of what Sciagraph does is pretending to be exactly the same as normal malloc() while doing something slightly different internally for tracking purposes. Coverage marks allow me to ensure black-box tests are hitting the right code path.

For more details see here.

Property-based testing

When possible, property-based testing is used to generate a wide variety of test cases automatically. I’m using the proptest library for Rust.

End-to-end tests

Sciagraph is designed to run inside a Python process, so for reliable testing it is critical to have tests that run a full Python program with Sciagraph injected. The flawed “test pyramid” notion of lots of unit tests and only a tiny number of end-to-end tests doesn’t apply in this particular situation: it’s necessary to have plenty of both.

Contracts and debug assertions

Sciagraph uses pre- and post-contracts, plus debug assertions, to ensure invariants are being followed. Of course, these are disabled in the release build for performance reasons. So to ensure correctness, the end-to-end tests are actually run twice: once with the release build, and once with debug assertions enabled.

Panic injection testing

Some of Sciagraph’s test make certain “failpoints” panic, using a technique similar to fail. This allows testing that unexpected failures in Sciagraph won’t impact the running program.

Environmental assertions on startup

Sciagraph has certain environmental invariants: for example, matching the appropriate version of Python. This can happen: Fil, the open source memory profiler related to Sciagraph, had a build system bug where code compiled against Python 3.6 was packaged for Python 3.9, leading to segfaults.

In addition, for performance reasons Sciagraph sometimes requires transgressive programming, violating abstraction boundaries and relying on internal details of glibc and CPython. These details are only likely to change every few years, with a major release, so it’s highly unlikely users will encounter them, but this is still a risk factor.

To prevent mysterious crashes, all of these invariants are tested on startup. If the checks fail, Sciagraph will cause the program to exit early with a useful error message. This is much better than segfaulting later in some arbitrary part of user code, which is both hard to debug and could in theory lead to corrupted user data.

Dependency due diligence

In selecting libraries to depend on, I try to pick reasonable dependencies; for example, all other things being equal, a library with a large user base is likely better than a library almost no one uses. But there are also some automated tests, in particular using Rust’s advisory database to ensure no dependencies have known security advisories, soundness issues, or are unmaintained.

Next steps, a partial list

Use cargo stdlib with assertions disabled via `cargo careful`

See https://www.ralfj.de/blog/2022/09/26/cargo-careful.html

Miri

Miri is a tool that will catch some bugs in unsafe code. From the documentation, it seems like it won’t work with FFI. Since FFI is the only reason Sciagraph uses unsafe, it seems like Miri would be both be difficult or impossible to use, and not particularly helpful, but I still need to investigate this.

Rudra

Rudra is a static analyzer for Rust that can catch certain unsoundness issues in unsafe code. I should run it on Sciagraph.

Other potential approaches to panic reduction

findpanics is a tool that finds panics using binary analysis of compiled code. rustig is similar, but seems even less maintained.

Try to reduce `unsafe` in third-party libraries

All other things being equal, a library using unsafe is more likely to have unsoundness bugs than a library that doesn’t use safe. It may be possible to switch some of Sciagraph’s dependencies to safer alternatives.

Mutation testing

The cargo-mutants package seems promising.

Real-world usage

There’s only so far internal processes can get you: testing software in production is in the end the only way to find certain problems.