Property Testing Stateful Code in Rust

Property testing is a very helpful form of testing. At its core, it involves doing the following:

Generate a "random" value
Test whether some property for that value holds
I the property holds, "shrink" the value, to try and simplify it, and see if the test still fails

So in practice, you can have some idea that your system validates something like the following:

for all (list: List), reversed(reversed(list)) == list

Your property testing framework can test this by generating a bunch of "random" values. If it finds a failure, it gives it to you, but first will try to simplify it (for example, by trying to remove elements from a list). The end result is a (hopefully) minimal counter-example to your test case.

This sort of testing is by no means a formal proof, but it can help find a lot of shallow bugs. And it's way more convincing than having 5 hand-crafted unit tests for the same property.

This testing strategy works less well for imperative flows though. Usually with imperative objects the reasoning might be more that you want to test that, after doing a bunch of operations, your mutated object is still good. Fortunately, there's a pretty easy pattern to turn an imperative, stateful flow into one that fits the property testing model nicely.

I am working on a performance improvement to a struct, Stack, inside of nushell. This structure can hold various variables. Currently, at various points in the REPL, we clone Stack objects, resulting in all of the variable state getting copied. This can be a costly operation when there are a lot of variables (or even just one "big" variable). So I have been looking at improving the cost of this cloning.

One idea that came up in discussions is to share the underlying data for a Stack when cloning. Simply speaking, a cloned Stack could still be mutated, but on lookups, if a key wasn't set on the clone, then we would look at the "parent" stack instead.

In practice, a Stack would have a parent_stack pointer, pointing to an underlying Stack object. Instead of doing a raw clone of our original Stack, in certain circumstances we instead create a new Stack with the parent_stack pointing to this cloned stack.

So in the old world we might do let new_stack = stack.clone(). Now we would instead do let new_stack = Stack::with_parent(stack). And, after some performance tests, we might realize that this is in fact safe and great.

But how do we convince ourselves that the old world and the new world don't have behavioral differences?

The first thing is to make this idea more formal. A behavioral difference between stack.clone() and Stack::with_parent(stack) would manifest itself in new_stack.get_var(key)'s result being different depending on which route we took.

But it's not just about the state of the structure after the clone. We want there to be no behavior differences even after we continue to do various things with the stack!

For example:

let mut with_clone = stack.clone()
let mut with_parent = Stack::with_parent(Rc::new(stack));
// let's do some operation on the cloned stack
with_clone.add_var(key_1, value_1.clone());
with_clone.remove_var(key_2);
// let's do the same operations on the parent-having stack
with_parent.add_var(key_1, value_1);
with_parent.remove_var(key_2);

// at this point we have done the same thing with both stacks,
// so both stacks should still "look the same"
assert_eq!(
  with_clone.get_var(key1),
  with_parent.get_var(key1),
)

// here, key2 should no longer be present, so `.get_var` returns None
// the behavior of keys not present also matters!
assert_eq!(
  with_clone.get_var(key2),
  with_parent.get_var(key2),
)

The ideas here:

we want with_clone and with_parent to return the same thing for lookups with "any key", after any number of operations
the operations we care about here are:
adding a variable
removing a variable
the cloning operation itself

A sketch of what a property test handling this looks like:

/// the kind of operations we can do
enum Operation {
    /// clone (either .clone() or by creating a new stack with a "parent stack")
    Clone,
    // add a variable
    AddEnvVar {
        key: VarId,
        value: Value,
    },
    // remove a variable
    RemoveEnvVar {
        key: VarId,
    },
}

fn test_parenting_equivalent_to_clone() {
    let mut using_clone = Stack::new();
    let mut using_parent = Stack::new();

    let all_keys = get_random_keyset();
    let operations = get_random_list_of_operations(&all_keys);

    for operation in operations {
        match operation {
          Operation::AddEnvVar {key, value} => {
              using_clone.add_var(key, value.clone());
              using_parent.add_var(key, value);
          }
          Operation::RemoveEnvVar {key} => {
              using_clone.remove_var(key);
              using_clone.remove_var(key);
          }
          Operation::Clone => {
             // here we are considering the idea of cloning,
             // but we are going to try our "performant" clone just on one 
             // object, and look for drifts in behavior with that.

             // slow version: just clone 
             using_clone = using_clone.clone();
             // faster version: new stack will have a parent
             using_parent = Stack::using_parent(Arc::new(using_parent));
          }

        }
        // after each operation, make sure we haven't desynced 
        verify_invariants(using_clone, using_parent, all_keys);
    }
}

fn verify_invariants(stack_a: &Stack, stack_b: &Stack, keys: &Vec<Key>) {
    // just check that every possible key we might have used returns the same result
    // (this likely includes keys that have not been used for any operation yet!)
    for key in keys {
        assert_eq!(stack_a.get_var(key), stack_b.get_var(key));
    }
}

Running test_parenting_equivalent_to_clone would generate a random set of operations, and run those, checking that our property of the stacks being "the same" holds in between each step.

The biggest assumptions that this sketch of the test makes are:

we assume we only care about the operations listed in Operation
we assume that we only care about the result of .get_var when it comes to two stacks being "equivalent".

This is either fine or totally wrong, depending on the context. Ultimately, property tests are for convincing, not proving. Like with other tests, we ultimately are not running every possible kind of operation.

Now that we have a plan of attack, let's walk through actually implementing this. I used proptest to implement property testing here.

I'm going to walk through some of the actual code I wrote to do this, given that when I started working on this I would have liked to see a walkthrough like this. I'm not going to go too deep into the details of proptest, they have good documentation in general.

First I created an enumeration that helped to capture the kind of operations I cared about:

// VarId is just usize, Value is a wrapper around values in nushell

#[derive(Debug, Clone)]
enum Operation {
    Clone,
    AddVar {
        key: VarId,
        value: Value,
    },
    RemoveVar {
        key: VarId,
    },
}

Then I need to create a value generator for the property tests. This is the Strategy trait in proptest.

A Strategy tells us how to generate a random value of a certain type, and also how to make a "simpler" version of a value (for example, making a 3-element list be a 2-element list instead). This simplifcation (called shrinking) lets proptest generate "minimal" reproductions. Most of the time, at least.

First, a Strategy to generate a single Operation. We are going to pass in keys and values to this strategy, and that will be used as a set of "random" keys and values to take from.

// this imports a bunch of goodies
use proptest::prelude::*;

fn get_operation(
    // passing in keys as an Rc, to avoid a bunch of copying
    keys: Rc<Vec<VarId>>,
    values: Rc<Vec<Value>>,
) -> impl Strategy<Value = Operation> {
    // avoiding capture shenanigans due to the closures
    let (keys_len, val_len) = (keys.len(), values.len());
    // annoying but kinda inevitable... the closures below want to take ownership
    let keys_1 = keys.clone();
    let keys_2 = keys.clone();

    // prop_oneof! takes a list of strategies, and gives a strategy that
    // will generate a value "at random" among the choices
    prop_oneof![
        // one kind of value: just clone
        // Just is a strategy that will only give one specific value
        Just(Operation::Clone),
        // another kind of value...
        // take a key index and a value index 
        // then generate an AddVar value based on that

        // we have a tuple containing 2 strategies
        // - 0..keys_len gives a strategy giving us a usize between 0 and keys_len
        // - 0..val_len gives a strategy giving us a usize between 0 and val_len
        //
        // we can then call `prop_map` on that strategy to map the generated pair 
        // from those two strategies to a new value. `prop_map` thus gives us a new
        // strategy
        //
        // if input_prop implements Strategy<Value=T>, and f:(T) => U, then 
        // input_prop.map(f) implements Stategy<Value=U>
        (0..keys_len, 0..val_len, prop::bool::ANY).prop_map(
            move |(key_idx, val_idx, env_var)| {
                Operation::AddVar {
                    // a common trick: in order to pick a random value from a
                    // vector, pick a random index, then just index into the vector
                    key: keys_1[key_idx].clone(),
                    value: values[val_idx].clone(),
                }
            }
        ),
        // this strategy picks a random key, and gives an operation that _removes_ the variable
        (0..keys_len).prop_map(move |key_idx| {
            Operation::RemoveVar {
                key: keys_2[key_idx].clone(),
            }
        })
    ]
}

With the strategy to generate operations, we can now create a strategy that generates a list of operations.

This is also where we are going to generate all the kinds of keys we will use in our operations. We generate a bunch of random keys and values, and then generate operations that will use some of those.

On a practical level, we return the keys used. This will let us verify our invariants later.

fn get_operations() -> impl Strategy<Value = (Vec<Operation>, Rc<Vec<VarId>>)> {
    (
        // key space. Our keys are just numbers,
        // so take 1-50 numbers between 0 and 1000.
        prop::collection::vec((0 as usize)..(1000 as usize), 1..50),
        // value space. Here we are generating 1-50
        // strings of 5 characters (this is generating random values)
        // matching the regular expression
        prop::collection::vec(".{5}", 1..50),
    )
        .prop_flat_map(|(keys, values)| {
            let keys = Rc::new(keys);
            let keys_clone = keys.clone();
            let values = Rc::new(
                // transforming Vec<String> into Vec<Value>
                values
                    .into_iter()
                    .map(|s| Value::String {
                        val: s,
                        // example of thing I am choosing to not care about
                        // (position in code)
                        internal_span: Span::new(0, 0),
                    })
                    .collect(),
            );
            (
                // this gives a strategy that generates a vector of 1 to 100 items of Operation
                // using the strategy returned by get_operation
                proptest::collection::vec(get_operation(keys, values), 1..100),
                // Just is a strategy that always just returns the inner value
                // a strategy to just give back the keys
                Just(keys_clone),
            )
        })
}

Above, we made a couple decisions: - the keyspace is limited to numbers between 0 and 1000 - the values we assign are just strings, and we haven't bothered generating other values - we only generate up to 100 items

This means I could theoretically write code that breaks for keys above 1000, or that would only break on the 101st operation. Again, property testing is about convincing, not proving!

Next up, we have our invariants. The invariant verification is simple: just do the check you wanted to do!

fn verify_invariants(a: &Stack, b: &Stack, keys: &Vec<VarId>) {
    for key in keys {
        let clone_lookup = a.get_var(*key, Span::new(0, 0));
        let child_lookup = b.get_var(*key, Span::new(0, 0));
        // sometimes assert_eq! is fine.
        // For this, though, sometimes it's better to just explicitly
        // write out an error message that spits out the state of your system
        if clone_lookup != child_lookup {
            panic!(
                "
                Lookup comparison failed, on key {}.

                Clone lookup result was {:?}
                Child lookup result was {:?}

                Child structure is {:?}

                ",
                key, clone_lookup, child_lookup, b
            );
        }
    }
}

Finally, we can get to the actual test.

// the proptest! macro sets up a bunch of the "random value generation" structure
// around calling strategies. The proptest docs go into what this macro does in more 
// detail

proptest! {
    #[test]
    fn test_parenting_equivalent_to_clone(
        // here we are asking for a value generated via the get_operations strategy
        // we will get a list of operations, and the keyset that is relevant
        (operations, keys) in get_operations()
    ) {
        let mut using_clone = Stack::new();
        let mut using_child = Stack::new();
        for operation in operations {
            match operation {
                Operation::Clone => {
                    // using_clone will just clone
                    using_clone = using_clone.clone();
                    // using_child will instead consume the existing result, and then create a child
                    let parent_stack = std::mem::take(&mut using_child);
                    using_child = Stack::with_parent(Arc::new(parent_stack));
                },
                Operation::AddVar {key, value} => {
                    using_clone.add_var(key, value.clone());
                    using_child.add_var(key, value);
                },
                Operation::RemoveVar {key} => {
                    using_clone.remove_var(key);
                    using_child.remove_var(key);
                },
            }
            // after doing an operation, let's actually check our invariants hold
            verify_invariants(&using_clone, &using_child, &keys);
        }
    }
}

And that's it! Now if I run this test, proptest will generate a bunch of random operations, run them, and see if my invariants hold. In my case, it did fail!

The original output was quite large, though, showing about 40 operations or so to cause the bug. I didn't believe that this was really a minimal failing case. Instead, I tweaked strategies to reduce the search space

// simplify the key space down to up to 5 keys...
prop::collection::vec((0 as usize)..(1000 as usize), 1..5),

...
// originally the values were 5 random strings. This meant the 
// strings had the "unicode garbage" look. Let's just use 
// a-f
// 
// we probably don't need that many values to find an issue either
prop::collection::vec("[a-f]{5}", 1..5),


// 10 operations is probably enough to cause the issue
proptest::collection::vec(get_operation(keys, values), 1..10)

By reducing the state space, proptest can more quickly get to an error, and spend more iterations trying to shrink down the state space.

I later found out that setting PROPTEST_MAX_SHRINK_ITERS to a "big number" (in my case, 10000), can also get you a good amount of shrinking, even if your initial search space is huge. But the default shrink iteration count might not be very satisfactory on larger structures.

After having done this, my failures were shrunk down to a simple 4 element sequence.

Clone lookup result was Err(VariableNotFoundAtRuntime { span: Span { start: 0, end: 0 } })
Child lookup result was Ok(String { val: "abfbd", internal_span: Span { start: 0, end: 0 } })

Child structure is <snipped large output>

minimal failing input: (operations, keys) = (
    [
        AddVar {
            key: 0,
            value: String {
                val: "aaaaa",
                internal_span: Span {
                    start: 0,
                    end: 0,
                },
            },
        },
        Clone,
        AddVar {
            key: 0,
            value: String {
                val: "aaaaa",
                internal_span: Span {
                    start: 0,
                    end: 0,
                },
            },
        },
        RemoveVar {
            key: 0,
        },
    ],
    [
        0,
    ],
)

Something nice about this is that, thanks to the operations being printed out like this, I could easily turn this failing case into a simple unit test, and use that to debug the failure. While the test itself is state-based, ultimately our failure cases are simple, serializable, "copy-paste into Slack"-able, and pretty friendly.

The input lists what operations we took to get to the failing state. In this case, there was a bug where deletions were being tracked on "parented" stacks but not properly handled, meaning deleted keys could reappear later on in the sequence.

The fun thing about this invariant testing is that when it does find the bug, it's fairly good at giving you the minimal reproduction. The second nice thing is that if you have such a test suite checked into your code, the moment you think of some other property to test, it's usually a couple lines of code.

I have a strategy to generate operations. I could pretty easily add a new operation I decide to care about, and have existing tests become more powerful. I could write new tests just relying on this strategy. One thing I didn't do here would be to have the Operation enum itself implement the behavior through an impl, to make our "inner loop" just be:

for operation in operations {
    stack = operation.do_thing(stack)
    // etc
    check_invariants(stack)
}

This doesn't prove that my code is right, but in my opinion it does more convincing than writing 50 or so tests. Short of pulling out the pen and paper and formally proving the model, in many cases this will cover a lot of business logic very well!

The biggest class of code that this can't really handle is code that involves very specific values.

Stuff like:

if some_string == "A_VERY_SPECIFIC_STRING" { 
    // do thing
}

will fall through the cracks, unless you are careful to make sure that your strategies include any magic values like that. You have to be a bit thoughtful of what coupling your input might need. But for a lot of code, throwing random values at the wall really will uncover nasty bugs.