#+TITLE: Privilege drop, privilege separation, and restricted-service operating mode in OpenBSD #+DATE: 2023-02-19 * Prologue My main focus in OpenBSD are privilege separated network daemons running in restricted-service operation mode. I gave talks at [[https://www.bsdcan.org][BSDCan]] and [[https://fosdem.org][FOSDEM]] in the [[file:index.org::*External Writings & Presentations][past]] about how I used these techniques to write [[https://man.openbsd.org/slaacd.8][slaacd(8)]] and [[https://man.openbsd.org/unwind.8][unwind(8)]]. While I do not think of myself as a one-trick pony, I have written some more: [[https://man.openbsd.org/slowcgi.8][slowcgi(8)]], [[https://man.openbsd.org/rad.8][rad(8)]], [[https://man.openbsd.org/dhcpleased.8][dhcpleased(8)]], and [[https://github.com/fobser/gelatod][gelatod(8)]]. I also wrote the first version of what later turned into [[https://man.openbsd.org/resolvd.8][resolvd(8)]]. At one point I claimed that it would take me about a week to transmogrify one daemon into a new one. * Why Privilege drop, privilege separation, and restricted-service operating mode are exploit mitigations. When[fn:: not if!] an attacker finds a bug we try to stop them from causing damage. The mitigations we are talking about here are aimed at attackers that achieved arbitrary code execution. Due to other [[https://www.openbsd.org/innovations.html][mitigations]] that is quite difficult to pull off. These are the last line of defence. We try to remove as many resources from the attacker to play with and try to crash the program as quickly as possible if an attacker touches something they are not supposed to. * Privilege drop Privilege drop is probably the weakest mitigation discussed in this article. It is a very old technique, but still important for set-user-ID root binaries. Theo de Raadt [[http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sbin/ping/ping.c.diff?r1=1.6&r2=1.7][refactored]] [[https://man.openbsd.org/ping.8][ping(8)]] over 26 years ago to open a raw socket early and then drop root privileges. This prevents a local user from elevating their privileges when finding a bug in ping(8): #+begin_src diff @@ -191,6 +191,14 @@ char rspace[3 + 4 * NROUTES + 1]; /* record route space */ #endif + if (!(proto = getprotobyname("icmp"))) + errx(1, "unknown protocol icmp"); + if ((s = socket(AF_INET, SOCK_RAW, proto->p_proto)) < 0) + err(1, "socket"); + + /* revoke privs */ + setuid(getuid()); + preload = 0; datap = &outpack[8 + sizeof(struct timeval)]; while ((ch = getopt(argc, argv, "DI:LRS:c:dfh:i:l:np:qrs:T:t:vw:")) != EOF) @@ -235,6 +243,8 @@ loop = 0; break; case 'l': + if (getuid() != 0) + errx(1, "must be root to specify preload"); preload = strtol(optarg, NULL, 0); if (preload < 0) errx(1, "bad preload value: %s", optarg); @@ -323,12 +333,6 @@ *datap++ = i; ident = getpid() & 0xFFFF; - - if (!(proto = getprotobyname("icmp"))) - errx(1, "unknown protocol icmp"); - if ((s = socket(AF_INET, SOCK_RAW, proto->p_proto)) < 0) - err(1, "socket"); - hold = 1; if (options & F_SADDR) { if (IN_MULTICAST(ntohl(to->sin_addr.s_addr))) #+end_src 20 years later we realized that we would not drop privileges when ping(8) is invoked as root. We would just "drop" from root to root. We can protect ourselves from a malicious ping target by [[http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sbin/ping/ping.c.diff?r1=1.214&r2=1.215][dropping]] to a dedicated user: #+begin_src diff @@ -272,8 +275,12 @@ /* revoke privs */ uid = getuid(); - if (setresuid(uid, uid, uid) == -1) - err(1, "setresuid"); + if ((pw = getpwnam(PING_USER)) == NULL) + errx(1, "no %s user", PING_USER); + if (setgroups(1, &pw->pw_gid) || + setresgid(pw->pw_gid, pw->pw_gid, pw->pw_gid) || + setresuid(pw->pw_uid, pw->pw_uid, pw->pw_uid)) + err(1, "unable to revoke privs"); preload = 0; datap = &outpack[ECHOLEN + ECHOTMLEN]; #+end_src ping(8) needs a raw socket to be able to send ICMP echo request packets. This is an operation that only root is allowed to do[fn::This prevents normal users from sending arbitrary IP packets, for example with spoofed IP addresses or from privileged (<1024) source ports.]. Once that socket is open though, ping(8) no longer needs to do any other privileged operation. It can hold on to the socket for later use and drop root privileges. Another use for privilege drop is in daemons to restrict file-system access by [[https://man.openbsd.org/chroot.2][chroot(2)]]'ing to =/var/empty=. The daemon needs root privileges to call chroot(2), but afterwards it can run without elevated permissions. To the process it looks like there is only the =/= directory where it does not have any permissions. The standard pattern can be seen in [[https://github.com/openbsd/src/blob/master/usr.sbin/rad/frontend.c#L200][frontend.c]] of [[https://man.openbsd.org/rad.8][rad(8)]]: #+begin_src C if ((pw = getpwnam(RAD_USER)) == NULL) fatal("getpwnam"); if (chroot(pw->pw_dir) == -1) fatal("chroot"); if (chdir("/") == -1) fatal("chdir(\"/\")"); if (setgroups(1, &pw->pw_gid) || setresgid(pw->pw_gid, pw->pw_gid, pw->pw_gid) || setresuid(pw->pw_uid, pw->pw_uid, pw->pw_uid)) fatal("can't drop privileges"); #+end_src We first get a user with [[https://man.openbsd.org/getpwnam.3][getpwnam(3)]] to drop to. The user has =/var/empty= configured as its home directory so we can use that in chroot(2). Next we [[https://man.openbsd.org/chdir.2][chdir(2)]] to the new file-system root to have a valid current working directory. This prevents us accidentally marking a file-system as busy depending on from where the daemon was started, preventing unmounting file-systems while the daemon is running. We then drop privileges by putting the user into a single group using [[https://man.openbsd.org/setgroups.2][setgroups(2)]]. The calls to [[https://man.openbsd.org/setresuid.2][setresgid(2) and setresuid(2)]] set the real, effective and saved group and user IDs. This safely drops from =root:wheel= to (in this case) =_rad:_rad= with no way to escalate back to =root=. This technique is probably used in most, if not all, OpenBSD's privilege separated daemons. * Restricted-service operating mode With privilege drop in ping(8) we prevent a local unprivileged user to gain superuser or root privileges. If there were a [[https://www.freebsd.org/security/advisories/FreeBSD-SA-22:15.ping.asc][bug in message parsing]][fn::I do not want to heckle FreeBSD, it is just that it is a good illustration for what we are currently discussing. FreeBSD's ping(8) is using capsicum, so it is well locked away, too. And it is not like I am not making any [[https://ftp.openbsd.org/pub/OpenBSD/patches/7.0/common/017_slaacd.patch.sig][mistakes]]...], a malicious ping target, or even host in the middle, could still read and exfiltrate ssh private keys. ping(8) runs as my user-id. It can read all files my user can read, it can open network connections to any host on the internet, it can execute arbitrary programs, heck it can talk to my GPU. That is a lot of power that it does not need. It only needs to write to =stdout= and =stderr=[fn::Which is usually the terminal.], and send and receive ICMP packets. We could lock ping(8) away using chroot(2), that at least takes away file-system access. But what can we do about programs that need file-system access but should not execute other programs or talk to the Internet? Like [[https://man.openbsd.org/file.1][file(1)]] for example. Decades ago Niels Provos developed [[http://man.openbsd.org/OpenBSD-5.9/man4/systrace.4][systrace(4)]] but it turned out that it was difficult to use. The only user in OpenBSD base was [[http://cvsweb.openbsd.org/src/usr.bin/ssh/Attic/sandbox-systrace.c?rev=1.18&content-type=text/x-cvsweb-markup][sshd(8)]]. In 2015 Theo de Raadt tricked Nicholas Marriott into privilege separating and sand-boxing file(1) using systrace. [[http://cvsweb.openbsd.org/src/usr.bin/file/Attic/sandbox.c][It lasted for half a year]], it was that painful. One problem with systrace(4) was that it worked on the level of syscalls and their arguments. This is not something user-land developers are intimately familiar with. We are interacting with libc and do not know what kind of syscalls libc does on our behalf. Another issue is that a program might need some [[https://man.openbsd.org/ioctl.2][ioctl(2)]]s or [[https://man.openbsd.org/sysctl.2][sysctl(2)]]s but it should not be able to do all of them. So we need to encode restrictions on arguments of syscalls. This gets unwieldy, fast. There was also [[http://man.openbsd.org/OpenBSD-5.9/systrace.1][systrace(1)]] to define a policy outside of the program. It turns out that most programs need to do some sort of initialization where they need wide access to the system. This is before they touch untrusted data. Once the initialization is done we can restrict access. How much we can restrict access can depend on command line flags. systrace(1) could not help with this, the program would retain all the privileges it needed for initialization. They would be fewer than all privileges, but still way to many. As far as I know, the experience with file(1) was the last straw. Theo set out to improve on this situation by developing tame(2), which was later renamed to [[https://man.openbsd.org/pledge.2][pledge(2)]][fn::It turned out that it was difficult to use tame(2) in a sentence when presenting the concept, hence the rename to pledge(2).]. pledge(2) was developed by studying all programs in OpenBSD base and putting their needed services into categories using broad strokes like /memory management/, /read-write on open file descriptors/, /opening of files/, or /networking/. If a program violates what it pledged to do, for example trying to open a file when it did not pledge =rpath=, it will be terminated with an uncatchable =SIGABRT=. It is worth repeating that: If a program violates what it pledged to do it will be *terminated* by the kernel. An attacker does not get to play again and try something else[fn::Needless to say that I despise init systems that restart services when they crash.]. This was an iterative process with patches floating around. A few co-conspirators, including myself, joined the effort a bit later to add pledge(2) to more programs. Once we hit 50 or so pledged programs, Theo considered it mature enough for commit and work continued in tree. Soon after, the list of programs not pledged at all was shorter than the list of pledged programs. This is a huge success and speaks to the usability of pledge(2). In decades we only had one program using systrace(4) in OpenBSD base, then pledge(2) shows up and in less than a year nearly all of OpenBSD base uses it. To add pledge(2) to a program we need to know what it does and potentially re-factor it to pull (hoist) one-time initialization up before pledge(2) is called for the first time. Some programs are sloppy in the sense that they open a certain resource the moment they need it, this means that they retain more access than they need. As we have seen with ping(8), if we pull opening of the raw socket before option parsing we can drop root privileges before touching untrusted data[fn::With a set-user-ID root program command line options are untrusted data!]. Since pledge(2) is internal to the program we can call it once we are done with option parsing and pledge different things depending on given options. [[https://github.com/openbsd/src/blob/master/sbin/ping/ping.c#L770][For example, ping(8) retains the ability to do DNS lookups depending on the =-n= flag]]: #+begin_src C if (options & F_HOSTNAME) { if (pledge("stdio inet dns", NULL) == -1) err(1, "pledge"); } else { if (pledge("stdio inet", NULL) == -1) err(1, "pledge"); } #+end_src pledge(2) is not fine-grained. It turns out that programs fall into broad categories of what they want to do after initialization. There are not hundreds of different promises for every obscure program, it is not needed. To add a new promise, a rule of thumb is: At least two programs have been identified that need a new promise. To add a syscall to an existing promise, that is, to give more power to an existing promise, needs careful evaluation of what all the other programs already using the promise, gain. It is not enough to show that it is fine for the new program, existing programs are much more important. Another question is how much additional kernel attack surface this exposes. pledge(2) does not only protect the user of the system or systems on the Internet from harm when a bug is found, it also protects the kernel from user-land. Checking if a syscall is allowed happens early, before a lot of kernel code runs. pledge(2) can be used to gain understanding on what a program does. We see the following pledges in file(1): #+begin_src shell $ cat -n file.c | fgrep 'pledge("' 171 if (pledge("stdio rpath getpw recvfd sendfd id proc", NULL) == -1) 210 if (pledge("stdio rpath sendfd", NULL) == -1) 374 if (pledge("stdio getpw recvfd id", NULL) == -1) 389 if (pledge("stdio recvfd", NULL) == -1) #+end_src The reader is encouraged to stop here and read [[https://github.com/openbsd/src/blob/master/usr.bin/file/file.c][the code]] around those pledge(2) calls to figure out what file does and why it pledges those things. ------------------------------------------------------------------------ On line 171, file(1) pledges all the things it needs. It can read-write already open file descriptors[fn::It can only read if the file descriptor was opened for reading and only write if the file descriptor was opened for writing. Since file(1) cannot open files for writing or create new files, it cannot write to disk.], open files for reading, pass file descriptors around, figure out under which user it runs and lookup another user. It can also fork itself. On line 210 it forked itself and we are in the parent process. It can shed all pledges that the forked child needs for its initialization as well as the ability to fork another instance. The parent process can now only open files for reading, pass those file descriptors to the child process and read-write already open file descriptors. On line 374 we are in the child process and we shed all pledges that the parent needed. We need to be able to read-write open file descriptors and receive file descriptors from the parent process. During initialization we need to know if we are running as root and be able to look up a user. file(1) can privilege drop to a dedicated user when invoked as root! Once we are done with that, the child process, which does all the magic[fn:: Pun intended.], sheds all those additional privileges and on 389 it can only read-write existing file descriptors and receive new file descriptors from the parent process. We see file(1) is privilege separated, running two processes. One process opens files but does not look at the contents. The other process, which is completely locked away, parses untrusted data and informs the parent process about the result. Neither process can talk to the Internet or write to disk. They cannot create new files or open existing files for writing. While it can read ssh private keys, it cannot exfiltrate them. We can find out what file(1) can and cannot do from just those four pledge lines, that is very powerful when starting a code review. They can also be used to get a quick overview how file(1) operates by reading the code adjacent to the pledge(2) calls, that is where interesting stuff happens. * Privilege separation A single process design that pledges ~"stdio inet rpath"~ still has a lot of attack surface. This is not good if it is a network daemon running as root and enabled per default on all installations. Like [[https://man.openbsd.org/dhcpleased.8][dhcpleased(8)]] for example. For starters it can read and exfiltrate ssh private keys. As we have seen with file(1), we can split up a program into multiple communicating processes that each pledge less operations than the sum of all pledges. We can move the risky operations of parsing untrusted data to a process that does not have access to the Internet, nor the file-system. That process will also not have any elevated privileges to change the system configuration like configuring network addresses or changing the routing table. An attacker who finds a loophole into this least privileged process will have a hard time creating havoc. They can only talk to more privileged processes using a very narrow communication channel with easy to parse[fn::When using the imsg framework, it is not even parsing, nor data marshalling. It is raw C structs of a fixed, and known at compile time, size.] messages. * OpenBSD network daemons We will look in detail at how dhcpleased(8) uses privilege separation and pledge(2) to implement a DHCP client in a safe way. Other network daemons follow a similar pattern and the reader should be able to study the source code of rad(8), slaacd(8), and unwind(8) to find out how those daemons work since they share a common ancestry. ** Overview dhcpleased(8) uses three communicating processes to implement privilege separation: /parent/, /engine/, and /frontend/. The /parent/ process retains its root privileges to make changes to the system like configuring IP addresses. This process must be protected from the outside world and we make sure that it does not interact with untrusted data. The /frontend/ process interacts with the outside world, for example by sending and receiving DHCP messages. But it does not parse DHCP messages it receives because that is untrusted data. It is the /engine/'s job to parse DHCP messages and implement the state machine for the DHCP protocol. The /engine/ process is completely locked away. dhcpleased(8) has a [[https://man.openbsd.org/dhcpleased.conf.5][configuration file]], following the typical OpenBSD syntax. For example, to send a custom vendor-class option in a ~DHCPDISCOVER~ or ~DHCPREQUEST~ packet, the configuration looks like this: #+begin_src interface vio0 { send vendor class id "foobar" } #+end_src Most configuration files on OpenBSD use this kind of syntax, from [[https://man.openbsd.org/cwmrc.5][cwmrc(5)]] to [[https://man.openbsd.org/pf.conf.5][pf.conf(5)]]. It also comes with a control program, [[https://man.openbsd.org/dhcpleasectl.8][dhcpleasectl(8)]], to interact with the running daemon. Many OpenBSD daemons provide a similar tool to interact with them. These are the features that come for free when using an existing OpenBSD network daemon as a template to write a new one: It will be privilege separated and using pledge(2) as a guide on how functionality should be split up between the processes. It will have a configuration file that uses a syntax that people familiar with OpenBSD will find easy to use. And there is a control process to interact with the running daemon. And finally there is a logging framework that handles logging to syslog or =stderr=. All this scaffolding and tooling is already there, we just need to swap out the specific code the old daemon uses and replace them with something new. ** Initialization dhcpleased(8) comes to life in [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/sbin/dhcpleased/dhcpleased.c#L144][void main(int argc, char *argv[]) in dhcpleased.c]]. After a bit of house keeping and argument parsing it ends up [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/sbin/dhcpleased/dhcpleased.c#L240][creating communications channels]] to talk to the /frontend/ and /engine/ processes and starts those two child processes: #+begin_src C if (socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC | SOCK_NONBLOCK, PF_UNSPEC, pipe_main2frontend) == -1) fatal("main2frontend socketpair"); if (socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC | SOCK_NONBLOCK, PF_UNSPEC, pipe_main2engine) == -1) fatal("main2engine socketpair"); /* Start children. */ engine_pid = start_child(PROC_ENGINE, saved_argv0, pipe_main2engine[1], debug, verbose); frontend_pid = start_child(PROC_FRONTEND, saved_argv0, pipe_main2frontend[1], debug, verbose); #+end_src ~start​_child()~ [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/sbin/dhcpleased/dhcpleased.c#L401][ensures]] that after forking the file descriptor the child process can use to talk to the parent process has number three: #+begin_src C if (fd != 3) { if (dup2(fd, 3) == -1) #+end_src It then [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/sbin/dhcpleased/dhcpleased.c#L407][sets up]] an =argv= array for [[https://man.openbsd.org/execv.3][execvp(3)]] to re-exec itself. The flags =-E= and =-F= control if the child process runs as /frontend/ or /engine/ process: #+begin_src C argv[argc++] = argv0; switch (p) { case PROC_MAIN: fatalx("Can not start main process"); case PROC_ENGINE: argv[argc++] = "-E"; break; case PROC_FRONTEND: argv[argc++] = "-F"; break; } if (debug) argv[argc++] = "-d"; if (verbose) argv[argc++] = "-v"; if (verbose > 1) argv[argc++] = "-v"; argv[argc++] = NULL; execvp(argv0, argv); #+end_src We used to only fork child processes, which is good enough for privilege separation. [[https://github.com/openbsd/src/commit/13ff36d2c36132325d9cc409c0621ef948f1e2e3][It then occurred to us that the child process will have the same memory layout and use the same stack protector cookies.]] Using fork & exec ensures that the child processes get a different memory layout. If there is an information leak in one process it cannot be used by an attacker to find gadgets in a different, potentially more privileged process. Going back to the main function, [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/sbin/dhcpleased/dhcpleased.c#L200][after option parsing]] we know if we are still in the parent process or in /engine/ or /frontend/ process: #+begin_src C if (engine_flag) engine(debug, verbose); else if (frontend_flag) frontend(debug, verbose); #+end_src The ~engine()~ and ~frontend()~ functions live in [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/sbin/dhcpleased/engine.c#L177][engine.c]] and [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/sbin/dhcpleased/frontend.c#L131][frontend.c]] respectively. Neither returns to the =main()= function. In the initialization of the child processes we drop privileges and pledge what the process needs to run. The /engine/ process pledges =stdio recvfd= and the /frontend/ process =stdio unix recvfd route=. We then set-up the [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/sbin/dhcpleased/frontend.c#L180][communication channel]] to the /parent/ (also known as the /main/) process: #+begin_src C imsg_init(&iev_main->ibuf, 3); iev_main->handler = frontend_dispatch_main; #+end_src As mentioned before we force the file descriptor to number 3 in ~start​_child()~ so the child process knows how it can reach the /parent/ process. We are using [[https://man.openbsd.org/event_init.3][libevent]] to call functions when an event happens on a file descriptor. Here ~frontend​_dispatch​_main~ is called when we receive a message from the /parent/ process. There is also a function ~frontend​_dispatch​_engine~ for messages from the /engine/ process. The naming scheme is ~RECEIVINGPROCESS​_dispatch​_SENDINGPROCESS~ and there are functions in =engine.c= and =dhpleased.c= to have a full mesh of communication channels between all three processes. Now that we have started the child processes and hooked up the communication channels between /parent/ and /frontend/ as well as /parent/ and /engine/, it is time to send our first message from the /parent/ to both children. The child processes pledged =recvfd=, which allows them to receive open file descriptors over an existing open file descriptor. The /parent/ process calls [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/sbin/dhcpleased/dhcpleased.c#L292][main​_imsg​_send​_ipc​_sockets()]] to create another socket pair and pass the end points to /engine/ and /frontend/ to create a full mesh. The file descriptor is received by [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/sbin/dhcpleased/engine.c#L387][engine​_dispatch​_main()]] and [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/sbin/dhcpleased/frontend.c#L232][frontend​_dispatch​_main()]] using a message type of =IMSG​_SOCKET​_IPC=. After receiving this file descriptor, /engine/ can [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/sbin/dhcpleased/engine.c#L446][drop]] the =recvfd= pledge and only pledges =stdio=. It no longer expects any more file descriptors. The start-up of /frontend/ is a bit more complicated. It needs to receive a route socket from the /parent/ process to learn of interfaces gaining or losing the =autoconf= flag during runtime. Once it received the =route= socket [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/sbin/dhcpleased/frontend.c#L329][from the parent process]] it can get a list of all interfaces that already have the =autoconf= flag and drop the =route= pledge in [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/sbin/dhcpleased/frontend.c#L650][frontend​_startup()]] afterwards: #+begin_src C frontend_startup(void) { if (!event_initialized(&ev_route)) fatalx("%s: did not receive a route socket from the main " "process", __func__); init_ifaces(); if (pledge("stdio unix recvfd", NULL) == -1) fatal("pledge"); event_add(&ev_route, NULL); } #+end_src It still needs to hold on to =unix= for communication with dhcpleasectl(8) and =recvfd= for receiving [[https://man.openbsd.org/bpf.4][bpf(4)]] sockets when new interfaces are set to =autoconf= with ifconfig(8). The astute reader will notice that we have not talked about pledging the /parent/ process. Unfortunately that is not possible because there is no pledge that would allow opening and programming a new bpf(4) socket and we need to create a new one when an interface is set to =autoconf= while dhcpleased(8) is already running. However, not all is lost. The parent process, which has to keep running as root, is not touching any untrusted data. An attacker needs to go through the /frontend/ or /engine/ process to gain a foothold in the /parent/ process. And they need to do this via the =main​_dispatch​_frontend= and =main​_dispatch​_engine= functions. We will look at those in a bit to see why that is a very difficult proposition. But not all is lost. We can restrict the amount of havoc an attacker can cause if they ever get all the way to the /parent/ process using [[https://man.openbsd.org/unveil.2][unveil(2)]]: #+begin_src C if (unveil(conffile, "r") == -1) fatal("unveil %s", conffile); if (unveil("/dev/bpf", "rw") == -1) fatal("unveil /dev/bpf"); if (unveil(_PATH_LEASE, "rwc") == -1) { no_lease_files = 1; log_warn("disabling lease files, unveil " _PATH_LEASE); } if (unveil(NULL, NULL) == -1) fatal("unveil"); #+end_src With unveil(2) a process can restrict its view of the file-system. The first call to unveil(2) removes access to the entire file-system from the process, except for the first argument of unveil(2). The second argument specifies the permission the program requests for the path: "=r=" means read, "=w=" write and "=c=" create. Subsequent calls unveil more parts of the file-system. Since we cannot use pledge(2) in the /parent/ process, we need to somehow lock-in the list of unveiled paths so that an attacker cannot simply add more. ~unveil(NULL, NULL);~ does that, any subsequent calls to unveil(2) will fail. It turns out that the parent process needs very little access to the file system. It needs access to the dhcpleased(8) config file, the bpf(4) device and the directory where lease files are stored. That list of files and directory does not include access to ssh private keys. ** dispatching As we said the /parent/ process is not touching any untrusted data, that is left to the /frontend/ and /engine/ process. The /frontend/ and /engine/ processes send imsg messages to the /parent/ process. Let's have a look at how those messages arrive and what the /parent/ process does with them. Messages from the /frontend/ process arrive at [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/sbin/dhcpleased/dhcpleased.c#L431][main​_dispatch​_frontend()]] in dhcpleased.c: #+begin_src C void main_dispatch_frontend(int fd, short event, void *bula) { // [...] uint32_t if_index; int verbose; // [...] for (;;) { if ((n = imsg_get(ibuf, &imsg)) == -1) fatal("imsg_get"); if (n == 0) /* No more messages. */ break; switch (imsg.hdr.type) { case IMSG_OPEN_BPFSOCK: if (IMSG_DATA_SIZE(imsg) != sizeof(if_index)) fatalx("%s: IMSG_OPEN_BPFSOCK wrong length: " "%lu", __func__, IMSG_DATA_SIZE(imsg)); memcpy(&if_index, imsg.data, sizeof(if_index)); open_bpfsock(if_index); break; case IMSG_CTL_RELOAD: if (main_reload() == -1) log_warnx("configuration reload failed"); else log_warnx("configuration reloaded"); break; case IMSG_CTL_LOG_VERBOSE: if (IMSG_DATA_SIZE(imsg) != sizeof(verbose)) fatalx("%s: IMSG_CTL_LOG_VERBOSE wrong length: " "%lu", __func__, IMSG_DATA_SIZE(imsg)); memcpy(&verbose, imsg.data, sizeof(verbose)); log_setverbose(verbose); break; case IMSG_UPDATE_IF: if (IMSG_DATA_SIZE(imsg) != sizeof(imsg_ifinfo)) fatalx("%s: IMSG_UPDATE_IF wrong length: %lu", __func__, IMSG_DATA_SIZE(imsg)); memcpy(&imsg_ifinfo, imsg.data, sizeof(imsg_ifinfo)); read_lease_file(&imsg_ifinfo); main_imsg_compose_engine(IMSG_UPDATE_IF, -1, &imsg_ifinfo, sizeof(imsg_ifinfo)); break; default: log_debug("%s: error handling imsg %d", __func__, imsg.hdr.type); break; } imsg_free(&imsg); } // [...] } #+end_src We see that it handles four message types =IMSG​_OPEN​_BPFSOCK=, =IMSG​_CTL​_RELOAD=, =IMSG​_CTL​_LOG​_VERBOSE= and =IMSG​_UPDATE​_IF=. One of them, =IMSG​_CTL​_RELOAD=, does not have any payload data, the /parent/ process just performs a predefined action. The /frontend/ process cannot influence how the action is performed, only when it is performed. For the other three, the /frontend/ process sends a piece of data that influences how the action is performed. The piece of data has a fixed, known at compile time, length. For example =IMSG​_OPEN​_BPFSOCK= is sent by the /frontend/ process when it learns of a new network interface gaining the /autoconf/ flag. It needs a new bpf(4) socket to send and receive DHCP messages on that interface[fn::bpf(4) sockets are bound to a specific interface and then locked, meaning the /frontend/ process will not be able to change anything about the socket. That also means that the /frontend/ process cannot send and receive arbitrary messages but only those that match the bpf filter.]. It sends a single =uint32_t= that uniquely identifies the network interface in the system. If it sends something else, we assume the /frontend/ process has been compromised and is no longer trusted. The /parent/ process terminates itself: #+begin_src C if (IMSG_DATA_SIZE(imsg) != sizeof(if_index)) fatalx("%s: IMSG_OPEN_BPFSOCK wrong length: " "%lu", __func__, IMSG_DATA_SIZE(imsg)); #+end_src We are using the same pattern of checking the data size for all the other message types send from the /frontend/ process. The same is true for how [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/sbin/dhcpleased/dhcpleased.c#L514][main​_dispatch​_engine()]] handles the messages sent by the /engine/ process. The /frontend/ process sends the =IMSG​_OPEN​_BPFSOCK= message using [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/sbin/dhcpleased/frontend.c#L632][frontend​_imsg​_compose​_main()]] in =frontend.c=. The sending function names follow a pattern of ~SENDINGPROCESS​_imsg​_compose​_RECEIVINGPROCESS~ and they are each a thin wrapper around ~imsg​_compose​_event()~ defined in =dhcpleased.c= which in turn is a thin wrapper around [[https://man.openbsd.org/imsg_init.3][imsg​_compose(3).]] A good way to understand what dhcpleased(8) is doing is searching for where all the [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/sbin/dhcpleased/dhcpleased.h#L194][imsg​_types]] are sent and received. ** Configuration file, control process and logging framework. We will not go into detail on how these features are implemented. They are not that relevant to the topic of privilege separation and the author is not an expert in LR(1) grammars or [[https://man.openbsd.org/yacc.1][yacc(1)]]. We will give some pointers to get interested readers started to learn on their own. The parsed configuration file is stored and passed around in a ~struct dhcpleased​_conf~ structure. The entry point into the parser is ~parse​_config()~ in [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/sbin/dhcpleased/parse.y#L722][parse.y]]. parse.y contains a hand written lexer as well as the yacc(1) grammar. Changing or adding to the grammar of the configuration file is done in three places: 1. New tokens are added to the grammar at the beginning of the file using the =%token= keyword. Tokens are in all caps. 2. Adding rules to the grammar, using the defined tokens 3. Teaching the lexer about new tokens. This is done by adding to the ~keywords~ array in the [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/sbin/dhcpleased/parse.y#L357][lookup()]] function. Passing =-nv= flags to dhcpleased(8) runs a configuration test and prints out the canonical form of the parsed configuration file. The code for this lives in [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/sbin/dhcpleased/printconf.c][printconf.c]]. Care should be taken to print a valid configuration, i.e. one that can be passed to dhcpleased(8) again. The source code for the control process can be found in [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/usr.sbin/dhcpleasectl/dhcpleasectl.c][usr.sbin/dhcpleased/dhcpleasectl.c]]. Communication is done using [[https://man.openbsd.org/unix.4][UNIX-domain sockets]] and connections are accepted by the /frontend/ process. Communication uses imsgs, and messages from the control process are handled by [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/sbin/dhcpleased/control.c#L221][control​_dispatch​_imsg()]] in control.c. The main difference to the dispatch function between the dhcpleased(8) processes is that messages with invalid data sizes are ignored instead of exiting the daemon. This is done because everyone in the =wheel= group can send messages to the daemon and we do not want them to be able to make the daemon crash[fn::Or even everyone on the system in case of unwind(8).]: #+begin_src C case IMSG_CTL_LOG_VERBOSE: if (IMSG_DATA_SIZE(imsg) != sizeof(verbose)) break; #+end_src Here dhcpleased(8) expects single integer to set the verbosity of the daemon. If we get something else we just ignore the message using ~break~. Finally, the network daemons come with a logging framework that handles logging to syslog or =stderr= when the daemons runs in the foreground using =-d=. The code for this can be found in [[https://github.com/openbsd/src/blob/3c46ceeaef274bbef234dac63245c4b6567168d7/sbin/dhcpleased/log.c][log.c]]. * Epilogue Writing software in C with security in mind can be a lot of fun when standing on the shoulders of giants and having things like privilege separation and restricted-service operating mode in your toolbox. I wrote two daemons, dhcpleased(8) and slaacd(8), that are enabled by default on every OpenBSD installation. We might eventually add a third one. Having the mitigations shown here as well as all the other mitigations constantly being added and enabled per default is what lets me sleep at night. I am cautiously optimistic that when a bug is found in dhcpleased(8) or slaacd(8) an attacker will have a hard time pivoting to arbitrary code execution as root.