Assignment: SANDBOX

Changelog:

  • 9 April 2025: link up ipc.c/h; update .tgz to correct error in build.sh writing dadasubst instead of dadasubst.o and to add #!/bin/bash and mark it executable
  • 9 April 2025: document functions in ipc.h here
  • 14 April 2025: update main.c in dadasubst.tgz to move i += 4 to avoid infinite loop when output file already exists
  • 15 April 2025: replace $<long string>$ which rendered in HTML as $$ with $(long string)$
  • 15 April 2025: be more explicit that it’s okay if the program does not complete when the sandboxed code attempts to do something dangerous; note that IPC recv functions abort the program if the other end terminates without sending anything

Until around 3pm 14 April, the main.c we supplied could enter an infinite loop when it failed to open one of the output files, including due to it not already existing. If you already downloaded the skeleton code, you can fix this by moving the i += 4; from aroundline 50 to line 25.

Your Task

  1. We have supplied text substitution library that lets one write something like

    $counter$. This is is the first item.
    $counter$. This is the second item
    $counter$. This was produced on $nodename$.
    

    and get text like

    1. This is the first item.
    2. This is the second item.
    3. This was produced on portal01.
    

    Our skeleton code [last updated 14 April 2025] can be built by running the script build.sh, which will build a utility dadasubst so that

    ./dadasubst in input1.txt out output1.txt in input2.txt out output2.txt ...
    

    will perform substitution on input1.txt, putting the result in output1.txt; on input2.txt, putting the result in output2.txt, and so on.

    Alternately dadasubst can take its input from the command line as in

    ./dadasubst literal '$counter$. On $nodename$.'
    

    will perform a substitution with the text on the commandline.

    Unfortunately, this substitution library is hideously insecure and only provided as a pre-compiled binary dadasubst.o. For example, supplying something like $(long string)$ will overflow a buffer on the stack; and $danger$ will try to create a file called “shouldnotbecreated.txt”.

    For reference, the source code is for dadasubst.o is available here . but you must use an unmodified version of the precompiled file.

    Your job is to modify the dadasubst program to run the text substitution library with system call filtering to prevent it from being (as) dangerous. When you are done, the $counter$ and $nodename$ substitutions must work properly and your program must use an unmodified version of the substitution library code using a seccomp-based sandbox. You should prevent the program from accessing any files it is not supposed to (even if an attacker who controls the input uses the stack overflow to achieve arbitrary code execution), opening network connections, etc. When the program attempts to do so, it is okay if you don’t produce useful output/exit early/etc.

    It is not required that you also prevent the program from using excessive resources (such as by allocating excess memory), but, obviously, a good sandboxing solution would do this.

  2. Submit your modified files to the submisison site. Be sure to include an appropriatey modified build.sh and not include any pre-compiled files.

Suggested method

  1. To use a seccomp-based sandbox, it would be easiest to run the library code in a separate process. We’ve supplied a small library to help with this, described below under the subheading “IPC library”. This library handles starting a “worker” process from the original “coordinator” process and communicating with that process.

  2. You can make the library operations occur in the worker process using the IPC library.

  3. I would recommend the following steps/design:
    • updating main() to start a worker process (using the IPC library’s start_worker_from_coordinator function). Make a seperate executable for the worker proecss; you can supply the name of that executable as the command argument to `start_worker_from_coordinator.
    • in the worker process, run a loop that calls recv_message_from_coordinator() and then calls dadasubst, and returns the result with send_message_to_coordinator
    • replacing the original dadasubst calls with calls to a wrapper function that uses send_message_to_worker and recv_message_from_worker to do the the dadasubst calls.
    • make the worker process, after it starts up, setup a seccomp sandbox to limit what system calls the worker needs to make.
  4. If we only needed to read and write data between our separate process and the main process, we could use seccomp’s “strict” filter mode. Unfortuantely, the substitution library uses uname() for $nodename$, which is not allowed by seccomp’s strict filter mode.

    See below regarding setting up an seccomp filter.

  5. You can figure out what system calls are necessary by using strace to examine which system calls a program makes. For example:

    strace -o trace.out /bin/ls
    

    will show you what system calls /bin/ls makes through its output to trace.out.

    To use this to handle cases where multiple processes are run, you can use strace’s -f option to trace all the processes resulting from a command. For example

    strace -f -o trace.out /bin/sh -c 'python -c print(1+1)'
    

    will run the shell /bin/sh, and then /bin/sh will spawn a new process to run python. In trace.out, each line will be prefixed by its process ID.

IPC library

  1. Since implementing proper process spawning and interprocess communication is not a major subject of this class, we’ve supplied some code here that implements this funcitonality:

  2. In this library, there are two proceses: a coordinator, and a worker, which must be started by the coordinator.

    The worker and coordinator can send and receive messages to and from each other.

  3. The library uses this simple message struct:

    struct message {
        size_t size;
        char *data;
    };
    

    to represent messages.

  4. The library implements the following functoins:

    • void start_worker_from_coordinator(const char *command)

      Start a new “worker” process running the command-line command specified.

      The worker is run with pipes setup so it can communicate with the corodinator; information about these pipes is passed through environment variables which will be read by the send_message_to_coordinator and recv_message_from_coordinator functions.

      You can only start one worker at a time.

    • void send_message_to_worker(struct message message)

      Send a message to the worker process, which must have been started previously using start_worker_from_coordinator.

      The worker can receive the message with the recv_message_from_coordinator function.

    • struct message recv_message_from_worker()

      Receive a message from the worker, waiting for the message to be sent if necessary.

      If the worker terminates before sending anything, print a message and aborts the program.

    • void send_message_to_coordinator(struct message message)

      Send a message to the coordinator. Can only be called from a worker started with start_worker_from_coordinator.

    • struct message recv_message_from_coordinator()

      Receive a message from the coordinator. Can only be called from a worker started with recv_message_from_coordinator.

      If the coordinator terminates before sending anything, print a message and aborts the program.

    • void wait_for_worker_to_exit()

      Called from the coordinator, waits for the worker process to exit (normally or abnormally).

Setting up filters using libseccomp

  1. On portal libseccomp is installed, which provides a relatively convenient interface for setting up a seccomp filter.

    For example, assuming CHECK is a macro that prints an error if its arugment isn’t true (in case of being out of memory, etc.):

    scmp_filter_ctx filter = seccomp_init(SCMP_ACT_KILL_PROCESS);
    CHECK(seccomp_rule_add(filter, SCMP_ACT_ALLOW, SCMP_SYS(read), 0) == 0);
    CHECK(seccomp_rule_add(filter, SCMP_ACT_ALLOW, SCMP_SYS(write), 0) == 0);
    CHECK(seccomp_load(filter) == 0);
    

    sets up a filter such that the current process can only make the read or write system calls. Since different system calls (prctl) are needed to setup new system call filters, after seccomp_load(filter) runs, the filter cannot be changed anymore.

  2. man seccomp_rule_add gives information about the types of rules you can add to libseccomp’s filters and what types of actions (like ALLOW and KILL_PROCESS in the eaxmple above) are permitted.

  1. libseccomp is implemented atop a more flexible, but harder to use, system call interface. I do not think this interface is an efficient/useful way to do this assignment, but this provides information about what libseccomp is doing and how you might implement more complex filters than libseccomp supports.

    In this interface, one specifies a filter program in BPF (Berkeley Packet Filter) format to idenitfy which system calls to allow or disallow.

  2. BPF is a special assembly-like language that is designed to be run within the operating system efficiently. It is called “Berkeley Packet Filter” because it was originally made for the BSD operating system to support efficient network packet filtering. BPF programs are passed to kernel as an array of instructions in what is essentially a simple machine code, similar to what you may have used in CS 2130. You can see a description of the instruction set and its machine-code like encoding here.

    There are two versions of BPF, the original one, where programs have access to a single accumlator, and an extended one (sometimes eBPF) where they have 11 registers. We’ll use the simpler single-accumulator version below rather than deal with the complications of encoding additional register usage.

  3. For seccomp, a loaded BPF program is run every time the process makes a system call. While the BPF program runs, it has access to a struct seccomp_data as its memory:

       struct seccomp_data {
           int   nr;                   /* System call number */
           __u32 arch;                 /* AUDIT_ARCH_* value
                                          (see <linux/audit.h>) */
           __u64 instruction_pointer;  /* CPU instruction pointer */
           __u64 args[6];              /* Up to 6 system call arguments */
       };
    

    The BPF program runs until it reaches a “return” instruction that identifies how to handle the system call — with constants SECCOMP_RET_* constants from <linux/seccomp.h> identifying which action to take.

  4. For example, the following BPF program implements the libseccomp filter shown above. In the snippet below, several C macros defined in the <linux/filter.h> header are used to conveniently construct the “machine code” for each BPF instruction.

       struct sock_filter filter[] = {
           // index 0:
           // load (LD) one word (W) into the accumulator from the "arch" field
               // since this is a seccomp filter, `struct seccomp_data` represents
               // the memory available to this program
               // ABS represents that we're giving a "memory address" as an fixed value
           BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, arch))),
    
           // index 1:
           // jump (JMP) if the accumulator is equal (JEQ) to the constant (K) AUDIT_ARCH_X86_64
               // if true: go ahead 1 statement more than normal (index 3)
               // if false: go ahead 0 statements more than normal (index 2)
           BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 1, 0),
    
           // index 2:
           // return (RET) the constant (K) SECCOMP_RET_KILL_PROCESS
           BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
    
           // index 3:
           // load (LD) one word (W) into accumulator from "nr" (syscall number)
           BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, nr))),
    
           // index 4:
           // jump if accumulator equal to SYS_read
               // if true +2 statements (index 7), if false +0 statements (index 5)
           BPF_STMT(BPF_JMP | BPF_JEQ, BPF_K, SYS_read, 2, 0),
    
           // index 5:
           // jump if accumulator equal to SYS_write
               // if true +1 statements (index 7), if false +0 statements (index 6)
           BPF_STMT(BPF_JMP | BPF_JEQ, BPF_K, SYS_write, 1, 0),
    
           // index 6:
           // return SECCOMP_RET_KILL_PROCESS
           BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
    
           // index 7:
           // return SECCOMP_RET_ALLOW
           BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
       };
       struct sock_fprog prog = {
           .len = sizeof(filter)/sizeof(filter[0]) /* number of elements in filter */,
           .filter = filter
       }
    
  5. Given a filter we can load using the prctl call (using the <sys/prctl.h> header):

    prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
    
  6. In order for PR_SET_SECCOMP to work (as a non-root user), we first need to disable set-user-ID programs (programs that run with elevated permissions like sudo) for the current process:

    prctl(PR_SET_NO_NEW_PRIVS, 1);
    

Preventing excess resource usage

  1. If you allow memory allocation system calls in the sandbox, then the program could allocate a very large amount of memory and use excessive system resources.

    Preventing is not required for this assignment, but Linux provides some mechanisms to achieve this.

  2. The simplest tool Linux provides is setrlimit (in <sys/resource.h>; see also man setrlimit on a terminal). Using this, you can set limits on the size of the current program’s address space size (the amount of valid addresses it has) using something like:

    struct rlimit lim;
    lim.rlim_cur = 1024 * 1024 * 32;
    lim.rlim_max = 1024 * 1024 * 32;
    setrlimit(RLIMIT_AS, &lim);
    

    which sets a limit of 32 MiB.

    Note that this limit is separate for each process, so this is not a suitable way for controlling multiple processes.

    setrlimit also supports setting limits on total compute time used, but this is, again, a per-process limit.

  3. A more recent tool are cgroups (“control groups”). This functionality allows you to set a number of memory limits on set of processes together. These limits apply to the total memory actually used, rather than the address space size, and account for sharing between processees and other things on the system.

    Unfortuantely, cgroups require system administrator privileges to setup, but these are used by things like Docker containers.

    On portal, systemd-run can run a program in a temporary cgroup with a memory limit for you by doing something like:

    systemd-run --pty --same-dir --user --wait -p MemoryHigh=32M -p MemoryMax=40M /bin/ls
    

    The above command runs /bin/ls with a 32MiB maximum memory usage target and a 40MiB limit (so it would be killed if exceeds 40MiB).

    Probably this would not work for use with our IPC library, because that library relies on passing file descriptors and environment variables to the program being run. This works when we run a program directly, but systemd-run is contacting a server and having it run the program. When it does so, it doesn’t pass our file descriptors or envirnoment variables.