Finding And Debugging A Nushell Performance Issue

I have been trying to use nushell as my daily driver, and it's pretty cool! It has a terseness that makes people come to POSIX shells, but offers a nicer syntax, just the right amount of type safety for a shell language, and ultimiately feels worth it.

But I found a weird perfomance issue while exploring some of the features. When looking at open files, I wanted to get a structured table from the string (normally doable with from ssv -m 1):

let $files = lsof | from ssv -m 1

This command would take a while (lsof takes a while!), but weirdly after running this my command prompt got much slower. Every time I would send a command, even an empty line, it would take several seconds to give me a new prompt.

It was a nasty performance bug, one that didn't show up when I just did lsof | from ssv -m 1 by itself, so I intuited quite quickly that this might be a problem with "lots of data in the execution environment". lsof's output was ~220k lines, so maybe nu was unhappy with that table size.

I decided to try and fix this issue, and hopefully figure out how to get better at debugging in the process. I cloned the repository, cargo build, then cargo run, and successfully reproduced this on main.

On top of that I ended up crashing my desktop while testing this across a couple shells at once, requiring a hard reboot. From this I assumed this was a memory issue, but did not verify.

First, to protect myself from further memory usage explosions and lockups, I tried out running with ulimit, which was a thing I knew existed. It's supposed to let you (among other things) limit memory usage of processes spawned by a shell after you provide it. Unfortuantely processes spawning subprocesses get around all of this and ulimit did basically nothing useful for me. Filed under "things that are probably never worth even trying to use".

systemd, for once, had a straightforward thing! Running the following let me limit memory usage to 100 megs:

systemd-run --user --scope -p MemoryMax=100M cargo run

I first made sure this actually did something by setting MemoryMax to 1M and witnessing nu crashing. The fact that systemd can provide this functionality does lead me to question ulimit's behavior, ignoring compatibility questions. But at least I learned this and can write myself some aliases or the like to get what I want!

At this point I finally decided to characterise the problem a bit more. First was trying to use perf record. I called perf record -F 99 -a -g -p $(pgrep nu) in a separate (zsh) shell, and got some output. After this I used Brendan Gregg's flamegraph scripts to get myself a flamegraph that should supposedly show me what nu is doing all that time it's seemingly hanging.

Doing this leads me to a nice-looking flamgraph... with a lot of things tagged [unknown]! It's quite frustrating to go from the tracing profiling offered by Python's profile libs to sampling profiling where you can feel a lot of stuff missing. I am 100% certain I'm doing something wrong here, but perf has often let me down like this, giving me half victories.

Trying to rebuild with cargo run --profile profiling would make builds take much longer, but perf record was not really offering much more information. Definitely doing something wrong, but I move forward with the little chunks of info I could intuit from the flamegraph.

Two semi-major things I could see from the flamegraph I was seeing is direnv showing up, and then calls to Value::clone. Nothing like the "this single function is taking 80% of walltime"-style symptoms I was expecting to find, though.

I use direnv, so first assumed nu had built-in support. I don't really find anything about direnv in the nushell repo beyond docs. I don't even find much beyond docs on Direnv's sites. Where is this coming from? Answer: my own config (Generated by home-manager, thank you for the integration work!)

I decide I need more tracing information from nushell. I had heard about USDT probes to be used with bpftrace. I have Brendan Gregg's big BPF performance tools book, and was flipping through that to find something that might help.

sudo bpftrace -l 'usdt:/mnt/tool/proj/nushell/target/profiling/nu:*' should show if there's any available USDT probes. There are none.

I finally decide I'm going to try integrating some probes via the usdt lib with Rust. I finally actually open the codebase. I look around for the REPL entrypoint.

See that there's already a perf call in the project! I'm a bit disappoitned that it's just calling info with some stuff, but it's better than nothing. I do some nu --log-level info and get a stream of useful-looking timings!

When reproducing the bug, I see that creating my prompt went from taking ~1ms to almost 1 second! At first this makes me think of my direnv deadend, but I just stare at the code a bit more.

I remember that in the flamegraph there are calls to Value::clone that take some time (not all). I grep for clone in the top-level REPL rust code. I get super lucky and see a call to stack.clone(). stack is refering to a Stack object, which is holding all the variables we have access to in our REPL.

I add some timing logging there, and confirm that that operation is indeed taking a long time after I've created a large variable. Hypothesis confirmed, nu is indeed affected by the "big" variable, and is slowing down because of it.

Decide to figure out how often stack.clone() is happening, by changing Stack's clone method to something with logging. I see a handful of calls, and immediately realize I don't know where they are at. I break out gdb and add a breakpoint, and find where these calls are. Great!

Extra logging helps me see a couple things:

in a fresh session, cloning is fast
a big variable makes this slow
There seem to be about 5 calls to clone in the REPL session per command run.

In a way this is a bit disappointing. I went in hoping to use a proverbial jackhammer to destroy a tiny bug, and in the end print-based debugging "saved the day". This is fine here because I was working on an easily accessible, relatively clean codebase. But the stronger tools theoretically should help me with trickier problems, and here they only barely helped. I want to get better on those.

I've identified that a lot of cloning is happening. What was happening exactly with the REPL?

On each prompt, nu was cloning the environment a couple of times times to hand out to the autocomplete code, the highlighter, and some other interspersed clones. Why clone? These systems needed mutable references to the environment, and Rust doesn't want us to have more than one mutable references floating around.

Could we make this better? The pseudocode of what was going on is like as follows:

loop {
    let read_input = reedline.with_autocomplete(
      Autocomplete::new(stack.clone())
      ).with_highlighter(
      Highlighter::new(stack.clone()))
      ).with_prompt(
      PromptGenerator::new(stack.clone())
      ).get_input()

    do_thing(read_input)

}

Here we are sharing the environment across a lot of different objects, so that we can have segregated code logic that is easy to manage. This means handing out a bunch of copies of the environment, because many of these ended up needing mutable versions of the environment.

But does an autocompleter or a highlighter need more than a read-only reference to the environment? Probably not! Looking at the code for the autocompleter and highlighter confirms to me that writing to a read-only reference should be straightforward.

The catch, of course, is that the prompt generator (what gives us the command line prompt), needs a writable environment. Why? Because the prompt can be configured by the user, so they can provide functions that create variables and read off of those. There is possibly a way to make things work read-only, but would involve creating some mechanism to evaluate code with a read-only environment. Too big of a task for me.

Even if we do all the work to get everything read-only, so long as one thing needs write access, we likely will need to have some wrapper or reference to get anything done.

This leads to a thing familiar to many people writing code in Rust: whether to make an object intrinsicly sharable or not.

currently we have a structure like as follows:

pub struct Stack {
    /// Variables
    pub vars: Vec<(VarId, Value)>,
    /// Environment variables arranged as a stack to be able to recover values from parent scopes
    pub env_vars: Vec<EnvVars>,
    /// Tells which environment variables from engine state are hidden, per overlay.
    pub env_hidden: HashMap<String, HashSet<String>>,
    /// List of active overlays
    pub active_overlays: Vec<String>,
    pub recursion_count: u64,
}

We have a bunch of data, and we can have one mutable reference to it. But in another world, we can do something like the following:

pub struct Stack {
   data: Arc<Mutex<StackData>>
}

struct StackData {
    pub vars: Vec<(VarId, Value)>,
    ...
}

In short, stacks are just pointers to reference-counted data. In this model, we can transparently create many references to the same underlying stack data, have helper methods on Stack to lock the mutex, then access the data, and have ergonomics closer to a high-level language rather than C.

But then you pay a two-pronged cost: - all of your stacks are now reference counted objects, and you pay that cost even if that's not needed. - All of your helper methods might look like : lock the data, do thing, then return the locked data. This is kind of fine if the underlying data can be itself cloned, but the moment you want to have more nuanced locking strategies, you're lost.

In this case, amongst all the various usages of Stack, I was looking at something specific to the REPL (remember, there's the shell, but also just the language), and a handful of clones for REPL-specific code. It would be easier for me to adjust the code for these components than to pay a cost across the board just for this.

So... I went to work wrapping the stack used in the REPL code.

On a practical level this involves changings &Stacks to Arc<Mutex<Stack>>s in a couple of places. I went for a type alias to make this a bit nicer:

type ShareableStack = Arc<Mutex<Stack>>

I also added a helper method:

    pub fn into_shareable(self) -> ShareableStack {
        Arc::new(Mutex::from(self))
    }

This has an added bonus of making it quick to port existing code for testing if all of this effort is even worth it.

But there's a practical thing here: we are actually introducing locking in a place where there previously was none.

Code like

let result = do_thing_with_stack(&mut self.stack)

now needed to handle the fact that self.stack was now an Arc. Most of the time this could be done straightforwardly:

let result = {
    let mut stack = self.stack.lock().expect("Shared stack deadlock");
    do_thing_with_stack(&mut stack)
};

The important thing being that self.stack.lock() locks the value for writing, but when the stack local value falls out of scope, then the object is unlocked.

Unfortunately this pattern involves adding extra indentation. Aesthetics on this stuff does matter IMO, and in a handful of places I ended up doing something like the following instead:

let mut locked_stack = self.stack.lock().expect("Shared stack deadlock");

do_thing_1(&mut locked_stack);
do_thing_2(&mut locked_stack);
locked_stack.call_mutating_method();
// we don't need the lock anymore
drop(locked_stack);

// actually done with stack mutations now
more_stuff();
even_more_stuff();

I did this in a handful of places where it just felt better than introducing indentation, but I'm not sure that's the right decision.

200-or-so lines of changes later, I have CI as green, my own performance issue fixed, and a PR in review. There were some very quick and thoughtful reactions, and I'm hopeful that sme variation of my change will actually get merged in.

This was an afternoon of digging and trying to fix a problem. I had gone in hoping to improve my profiling chops, but hit a bunch of dead ends there. gdb was helpful in helping me to diagnose where relevant calls to clone() were, and ultimately I feel very lucky to have been working on code where there were really only a handful of usages.

Things like building nu in debug mode and without name stripping didn't seem to help tools provide much more information, but just printing out log messages worked well enough. It is personally unsatisfying because I feel like I got lucky finding the issue.

My notes included some vague learnings, but really I just struggle to get perf to give me the magical feeling I'd get with Python's profile + snakeviz. I understand they're different toolkits, though. I was hoping that while writing up this experience I'd come across some deeper realizsations, but I fundamentally didn't.