Implementing a Container Runtime Part 1: Spawning Processes on Linux

2024-11-13

by Remi Bernotavicius <remi@abort.cc>

Spawning child processes using your programming language’s provided APIs can be very straightforward in a modern language. For example Rust’s std::process:Command provides an easy interface:

use anyhow::Result;

fn std_command_spawn_wait() -> Result<()> {
    let mut child = std::process::Command::new("/bin/echo")
        .stdout(std::process::Stdio::null())
        .spawn()?;

    let status = child.wait()?;
    assert!(status.success());

    Ok(())
}

These APIs make it really easy to get things right. They are the first thing you should reach for. Sometimes, though, you may find yourself needing to do things that just aren’t supported by these simple APIs. We found ourselves in this position working on Maelstrom, since we run each test in its own set of Linux namespaces. Maybe you too want to do something with namespaces, or maybe you want to use a pidfd. If that’s the case, then you might need to dig deeper and discover what the underlying APIs are capable of.

Of course, once you dig deeper, you might quickly find yourself confused. On Linux there are several APIs that all spawn a child process. There is fork, vfork, posix_spawn and clone. So which one do you pick? How are they different?

This article tries to answer these questions. Also, at the end I provide a simple flow chart that I hope makes it easy for you to decide which approach to take.

The code examples I provide will be in Rust, which is what Maelstrom is written in, but the information should apply to all programming languages. The examples will use the maelstrom_linux crate which is our own wrapper around libc and the kernel specific to the Maelstrom project. You may want to write your own wrappers or use some crate like nix.

The fork and exec model

The classic way of spawning a child process on Linux and Unix is to use two syscalls together, fork and exec. fork creates a copy of the current process. exec loads and executes a new program in the current process.

Splitting it up into two different syscalls allows the developer to do a bunch of setup in the child process just by writing regular code. Having this be separated into two calls with each doing one thing, and the ability to write this setup as plain code, is sometimes referred to as the “elegance” of fork.

The following code snippet runs echo with no arguments which acts as a kind of no-op for benchmarking purposes.

use anyhow::Result;
use maelstrom_linux::{self as linux, ExitCode, WaitStatus};

fn fork_execve_wait(dev_null: &impl linux::AsFd) -> Result<()> {
    if let Some(child) = linux::fork()? {
        // This is executed by the parent
        let wait_result = linux::wait()?;
        assert_eq!(wait_result.pid, child);
        assert_eq!(wait_result.status, WaitStatus::Exited(ExitCode::from_u8(0)));
        Ok(())
    } else {
        // This is executed by the child
        // Do the housekeeping before we call execve
        linux::dup2(dev_null, &linux::Fd::STDOUT).unwrap();
        linux::execve(c"/bin/echo", &[None], &[None]).unwrap();
        unreachable!()
    }
}

I will refer to this setup period in the child before calling exec as the “housekeeping”. In this housekeeping code you can imagine doing a number of different things, most of them being syscalls. This can include configuring file descriptors, sessions, users, groups, or namespaces for the child. Also, since exec is a separate function, you can choose to not call it and instead just continue executing the current program.

fork in Multi-Threaded Programs

fork seems pretty simple until you start to consider multi-threaded programs. When you call fork in a multi-threaded program only the calling thread is copied from the parent process. This creates a strange programming environment in the child. In particular locks won’t function properly since a thread holding a lock can just disappear after the fork! The idea of avoiding certain things after forking I will refer to as “fork safety.”

Fork safety can sometimes be tricky to maintain. Simple things like allocating memory may not work due to the aforementioned locking problem, and memory allocations are often everywhere. A real gotcha here are signal handlers, they are copied from the parent so it means that can be called in the child but they have a high chance of accessing some global memory requiring a lock.

One solution to this problem is to employ a programming pattern known as “zygote”. In this pattern we fork a child process (called the zygote) before we become multi-threaded and then tell our zygote via IPC to spawn any further children we want.

+---------+         +---------+          +---------+  +-------+
| parent  |         | zygote  |          | kernel  |  | child |
+---------+         +---------+          +---------+  +-------+
     |                   |                    |           |
     | IPC request       |                    |           |
     |------------------>|                    |           |
     |                   |                    |           |
     |                   | fork               |           |
     |                   |------------------->|           |
     |                   |                    |           |
     |                   |                    | spawn     |
     |                   |                    |---------->|
     |                   |                    |           |
     |                   |                    |      exec |
     |                   |                    |<----------|
     |                   |                    |           |
     |                   |                    |      exit |
     |                   |                    |<----------|
     |                   |                    |           |
     |                   |      wait response |           |
     |                   |<-------------------|           |
     |                   |                    |           |
     |      IPC response |                    |           |
     |<------------------|                    |           |
     |                   |                    |           |

The zygote remains single-threaded so we can write the housekeeping code as before. Having to communicate with the zygote is a little annoying, but makes things easy to get right at least from a fork safety perspective.

fork is Too Slow!

It turns out that fork can be pretty slow in some contexts. A large part of the time is spent copying the virtual memory mappings of the parent, and the time it takes scales linearly with the amount of memory mapped. One way around this is the aforementioned “zygote” pattern. Since the “zygote” process is spawned early, it shouldn’t have much memory mapped. This doesn’t completely avoid the problem though, because copying the virtual memory of a small process is still slow.

Graph of Fork Calls

The graph shows how the time of the fork call increases linearly with the amount of memory mapped. It would be great to avoid this slow process of copying the virtual memory mappings altogether which is slow even if your program is not using much memory (like the zygote is.)

vfork to the Rescue

The vfork syscall is like fork except it doesn’t copy the parent’s virtual memory mappings and instead shares the same memory space. The call is very similar to fork and seems like it could be a drop-in replacement, but its own man page cautions against this (see below.)

The child process thread shares the exact same memory as the parent calling thread (writes in the child appear in the parent) including the stack. Having two different threads (or in this case processes) use the same stack at the same time simultaneously like this, doesn’t work. So the calling thread in the parent process is suspended until the child calls exec or _exit.

+---------+               +---------+         +-------+
| Parent  |               | Kernel  |         | Child |
+---------+               +---------+         +-------+
     |                         |                  |
     | vfork                   |                  |
     |------------------------>|                  |
     |                         |                  |
     |                         | Create child     |
     |                         |----------------->|
     |                         |                  | ---------------------\
     |                         |                  |-| Returns from vfork |
     |                         |                  | |--------------------|
     |                         |                  | --------------------\
     |                         |                  |-| Does housekeeping |
     |                         |                  | |-------------------|
     |                         |                  |
     |                         |             exec |
     |                         |<-----------------|
     |                         |                  |
     |      Returns from vfork |                  |
     |<------------------------|                  |
     |                         |                  |

(We didn’t include a code snippet because we can’t call this function from Rust unless we use the unstable ffi_return_twice attribute)

vfork has two new sources of potential issues. The first is the fact that we are sharing the same memory space. Care must be taken to not unintentionally modify memory in the parent in some way that will cause issues. This drawback is tied directly to the performance improvement we want (avoiding copying the virtual memory mappings.)

The second source of potential issues is the fact that the child ends up executing on the same stack as the parent. This won’t work in general without some form of support from the compiler. This is because we “return twice” from the vfork call (see the diagram above.) The gcc returns_twice attribute does this, but it may not provide as much support as you might expect.

Here is a quote from tldp about this https://tldp.org/HOWTO/Secure-Programs-HOWTO/avoid-vfork.html

“…it’s actually fairly tricky for a process to not interfere with its parent, especially in high-level languages. The “not interfering” requirement applies to the actual machine code generated, and many compilers generate hidden temporaries and other code structures that cause unintended interference. The result: programs using vfork(2) can easily fail when the code changes or even when compiler versions change.”

So what are you allowed to do in the housekeeping exactly? Let’s check what the man page says about this.

“..the behavior is undefined if the process created by vfork() .. calls any other function before successfully calling _exit(2) or one of the exec(3) family of functions.“

This isn’t very helpful either. Clearly calling _exit or exec (but not every form of exec it turns out) is okay, but what other housekeeping is okay in practice? It’s not entirely clear, and researching across the internet leads to many others showing a fair amount of anxiety about this problem

On Linux, the behavior of vfork can be recreated using clone, (which we will cover later) in a way where we don’t have to share a stack. This is always preferable.

posix_spawn

Another way to try to get the speed up we want would be to use posix_spawn. The latest version of glibc always does the equivalent of calling vfork in its posix_spawn implementation.

posix_spawn is actually what std::process::Command tries to use internally for most cases.

use anyhow::Result;
use maelstrom_linux::{self as linux, ExitCode, WaitStatus};

fn posix_spawn_wait(dev_null: &impl linux::AsFd) -> Result<()> {
    let mut actions = linux::PosixSpawnFileActions::new();
    actions.add_dup2(dev_null, &linux::Fd::STDOUT)?;
    let attrs = linux::PosixSpawnAttrs::new();
    let child = linux::posix_spawn(c"/bin/echo", &actions, &attrs, &[None], &[None])?;
    let status = linux::waitpid(child)?;
    assert_eq!(status, WaitStatus::Exited(ExitCode::from_u8(0)));
    Ok(())
}

Calling posix_spawn comes with a whole lot fewer caveats and things to be careful of when compared to fork and vfork. It comes at the cost of losing the “elegance” of fork. You configure the housekeeping by using a struct. It is a kind of “housekeeping script” we create which executes after the fork.

The downside of using posix_spawn is that our housekeeping is limited to doing whatever things the housekeeping script has support for. (See the posix_spawn man page for a complete list of things.)

clone the API Underpinning it All

The aforementioned fork, and posix_spawn actually call clone under the hood in glibc. Also vfork inside the kernel ends up calling into the kernel’s clone code. It has the functionality of the previous APIs and a bunch of other features.

It can be more difficult to use though. Although, unlike vfork it allows the parent to allocate a separate stack for the child process, avoiding some of the issues with vfork.

use anyhow::Result;
use maelstrom_linux::{self as linux, ExitCode, WaitStatus};

struct ChildArgs {
    dev_null: linux::Fd,
}

/// This function executes in the child
extern "C" fn child_func(arg: *mut std::ffi::c_void) -> i32 {
    let arg: &ChildArgs = unsafe { &*(arg as *mut ChildArgs) };

    linux::dup2(&arg.dev_null, &linux::Fd::STDOUT).unwrap();
    linux::execve(c"/bin/echo", &[None], &[None]).unwrap();
    unreachable!()
}

fn clone_clone_vm_execve_wait(dev_null: &impl linux::AsFd) -> Result<()> {
    const CHILD_STACK_SIZE: usize = 1024; // 1 KiB of stack should be enough
    let mut stack = vec![0u8; CHILD_STACK_SIZE];

    // We need to pass the file-descriptor for /dev/null through to the child.
    let child_args = ChildArgs {
        dev_null: dev_null.fd(),
    };

    // Clone virtual memory, and give us SIGCHLD when it exits.
    let args = linux::CloneArgs::default()
        .flags(linux::CloneFlags::VM | linux::CloneFlags::VFORK)
        .exit_signal(linux::Signal::CHLD);

    // The function accepts a pointer to the end of the stack.
    let stack_ptr: *mut u8 = stack.as_mut_ptr();
    let child = unsafe {
        linux::clone(
            child_func,
            stack_ptr.wrapping_add(CHILD_STACK_SIZE) as *mut _,
            &child_args as *const _ as *mut _,
            &args,
        )
    }?;

    let wait_result = linux::wait()?;
    assert_eq!(wait_result.pid, child);
    assert_eq!(wait_result.status, WaitStatus::Exited(ExitCode::from_u8(0)));
    Ok(())
}

The CLONE_VM flag (via linux::CloneFlags::VM) avoids copying the virtual memory from the parent.

The CLONE_VFORK flag suspends the parent until the child calls exec or exits. Passing this flag allows us to not worry about waiting for the right moment to free the child’s stack memory in the parent. If we don’t pass this flag though, we are able to do other things in the parent in parallel, but then we need some way to know when we can free the stack memory. One way is to share a pipe or socket with the child which is CLOEXEC, once the child calls exec (or exits) this pipe or socket will close.

Graph of Fork Calls

This graph looks much better than the one for fork. It is a pretty constant speed and faster than the fastest fork!. About the same performance is found with posix_spawn.

Finally, let’s compare all the options (minus vfork because we can’t call it) for a program without much memory mapped.

ran std::process::Command::{spawn + wait} 10000 times in 5.400743724s (avg. 540.074µs / iteration)
ran fork + execve + wait 10000 times in 4.068149021s (avg. 406.814µs / iteration)
ran posix_spawn + wait 10000 times in 2.918496041s (avg. 291.849µs / iteration)
ran clone(CLONE_VM) + execve + wait 10000 times in 2.907074411s (avg. 290.707µs / iteration)
ran clone(CLONE_VM | CLONE_VFORK) + execve + wait 10000 times in 2.883697173s (avg. 288.369µs / iteration)

Rust’s std::process::Command comes in at the slowest even though it should be comparable to posix_spawn, this could be due to differences in the housekeeping or other things the std code is doing.

Conclusion

Let’s tie it all together with a flow graph about what to use.

                                  start
                                    V
+--------------+    yes    +---------------------+
| posix_spawn  | <-------- |simple housekeeping? |
+--------------+           +---------------------+
                                    |no
                                    |
                                    V
                           +---------------------+
                           |performance critical?|--\
                           +---------------------+  |
                                    |no             |
                                    |               | yes
                                    V               |
           +------+    yes   +----------------+     |
           | fork | <--------|single threaded?|     |
           +------+          +----------------+     |
                                    |no             |
                                    |               |
                                    V               |
        +------+   yes   +-----------------------+  |
        |zygote|<--------|foolproof to implement?|  |
        +------+         +-----------------------+  |
                                    |no             |
                                    |               |
                                    V               |
                +------+   yes   +------+           |
                |vfork |<--------|POSIX?|<----------/
                +------+         +------+
                                    |no
                                    |
                                    V
                                 +------+
                                 |clone |
                                 +------+

Addendum

You can check out working code for the snippets and benchmarks here

Be sure to check back for the next part of this article series where we dive into the code in Maelstrom that calls clone.