Discussing Exploitation and Priv Escalation - Analysis of gVisor exploit

Written by Brandon Lum 05 Dec 2018

In this blog post, we take a look at the work that Max Justicz wrote about in his post “Privilege Escalation in gVisor, Google’s Container Sandbox”. We discuss the threat model of gVisor, what the exploit provides, and the differences between the LibOS security model between gVisor and Nabla containers. Finally, we explore some other threats.

Exploitation 102: Attacker capabilities

To understand a few concepts that we would like to talk about, let us go into a quick primer on how exploits are written. There are multiple reasons why an attacker would like to exploit an application. The most common reasons are that they want to leak some data from the application, or they want to be able to execute arbitrary code in the application (code exec).

Let’s assume that we are exploiting an application (this could be something like redis or nginx) and we want to achieve arbitrary code execution (code exec).

Starting with our goal, we want to describe how an attacker derives an exploit for arbitrary code execution. How do we get arbitrary code execution? One way of doing that is by hijacking the control flow of the application. This can be done via various methods, like overwriting function pointers, overwriting an entry in the .got (Global Offset Table entries contain function pointers of dynamically linked libraries), or taking advantage of architecture specific behaviors (like overwriting the return address in the stack used by the ret instruction), etc.

The main way of figuring out how to do this is by reverse engineering the application binary and finding bugs in the application. For example, a failure to check bounds may lead to a leak of stack data or heap data (Heartbleed). And a use-after-free bug in memory allocation may be able to provide the ability to write to an arbitrary memory address.

These type of bugs are too varied to identify and classify, which is why one way that we may think about them as attacker capabilities - what a bug gives us. For example, a use-after-free bug may provide us with an arbitrary write to any memory address, and the ability to leak some address information, we call a leak.

These usually give us the ability to exploit more bugs - by learning about the location of data structures and code pages, or by triggering code paths originally not reachable. For example, a leak counters Address Space Layout Randomization (ASLR) by showing us the significant bits of the randomized address space. An off-by-one error could allow us to write into adjacent data structures, giving us the ability to trigger other bugs (i.e. set an integer to a value it should never be).

In general, with attacker capabilities to leak and arbitrary write, it is almost a guarantee that we can get the capability to code exec. However, sometimes, we may only get limited versions of these capabilities. For example, instead of arbitrary read/write, we could only have limited read/write to a certain memory region.

Other examples of these are unreliable read/writes (does not always happen, depending on some randomness of the program), or read/writes that can include randomness or proximity. However, these type of limited capabilities can be overcome with additional work - performing actions more than once (rowhammer) or using certain techniques (i.e. heap spraying, nop slides).

Explaining “Privilege Escalation in gVisor”

+--------+ +--------+   +---------+
|  p1a   | |  p1b   |   |   p2a   |                      APP
+--------+ +--------+   +---------+
+-------------------+   +-------------------+
|      gVisor       |   |      gVisor       |            gVisor
+-------------------+   +-------------------+
+--------------------------------------------------+
|                    kernel                        |     KERNEL
+--------------------------------------------------+

Before we jump into the exploit, let us look at how gVisor is used. Instances of gVisor can be created independent of each other with a shared kernel. For each gVisor instance, there can be one or more application processes running on them (i.e. p1a, p1b).

Now, let’s take a look at Max’s work with gVisor. In his blog, he explains the bug that he has found in the gVisor implementation of the shmctl syscall, and an example of how he is able to perform targeted writes to a seperate process using the same gVisor userspace kernel.

The exploit assumes the following attacker capabilities: Code exec in an application process. This can be obtained via exploitation of the process p1a (i.e. redis/nginx). We annotate the diagram with attacker capabilities assumed in parentheses.

+--------+ +--------+   +---------+
|  p1a   | |  p1b   |   |   p2a   |                      APP
| (code) | |        |   |         |
| (exec) | |        |   |         |
+--------+ +--------+   +---------+
+-------------------+   +-------------------+
|      gVisor       |   |      gVisor       |            gVisor
+-------------------+   +-------------------+
+--------------------------------------------------+
|                    kernel                        |     KERNEL
+--------------------------------------------------+

The exploit abuses the shmctl bug to acheive the attacker capabilities to perform limited write to select memory region of which gVisor has access to. Max then uses this limited write capability to show that he has obtained capability of random writes to the adjacent application process p1b, by attempting to write ‘A’s (0x41) to the specific region/pages exposed by the bug.

    +---WRITE--+
    |          |
    |          V
+--------+ +--------+   +---------+
|  p1a   | |  p1b   |   |   p2a   |                      APP
| (code) | |(random)|   |         |
| (exec) | |( write)|   |         |
+--------+ +--------+   +---------+
+-------------------+   +-------------------+
|      gVisor       |   |      gVisor       |            gVisor
|  (limited write)  |   |                   |
+-------------------+   +-------------------+
+--------------------------------------------------+
|                    kernel                        |     KERNEL
+--------------------------------------------------+

Safe for now

This proof of concept has shown random writes into an adjacent process on the same instance of gVisor. However, we note that to get the most out of gVisor isolation, one would want to have each process be on seperate gVisor instances. Thus, there is no reason to panic today, based on this proof of concept alone.

Other implications

This is one of many direction that Max could have taken to build up his exploit - by showing random write into a adjacent application sharing the same gVisor instance. However, he mentions that for the limited write region in this bug:

The backing memory is then reclaimed and handed to another (potentially more privileged) process.

This may perhaps lead to possibilities to obtain more attacker capabilities in the gVisor instance - maybe even code exec? We’ll discuss this next.

Exploring differences in userspace kernel security model

In this section, we will outline some of the ideas that are brought up with the security that a userspace kernel provides. We will then see how some of those ideas relate to gVisor and Nabla.

What does a userspace kernel provide?

Before we delve deeper into specifics of gVisor and Nabla, let us first discuss what a userspace kernel implementation is used, and its goals.

A userspace kernel typically takes functionality out of the privileged host kernel and implements it in userspace. An example of this is taking the TCP/IP stack and implementing it in userspace. The userspace implementation then calls into the host kernel when it needs to talk to hardware via writing frames to a layer 2 TAP device.

In terms of security that a userspace kernel provides, there are several aspects that can be discussed:

Reduced Attack Surface: By moving functionality from the privileged host kernel to userspace, it reduces the amount of code executed in the kernel¹, thus reducing the attack surface of the privileged host kernel. (i.e. a bug in the TCP/IP stack that can be exploited for code exec would not be able to gain host privileges as the bug was in a userspace kernel instead of the host kernel).
Guarding host kernel calls: By using a userspace kernel in a pattern such that it sits inbetween the application and the host kernel, the userspace kernel acts as a guard that restricts the arguments of a syscall to the host kernel. One example to think of such an effect is the use of a TCP socket interface in the userspace kernel. Since the packets/frames are generated from the userspace kernel and not from the application code, all packets/frames sent to the host kernel should not be malformed (unless an attacker is able to trick the userspace kernel, i.e. Max’s work).
Safer kernel implementation: Through the use of safer languages, formal verification, and other techniques in the development of a userspace kernel, it is possible to have better security of the functions being implemented by the userspace kernel on behalf of the host kernel. Performing these techniques on existing commonly used monolithic host kernels like Linux usually require a lot more work due to several factors (working around an existing design, code base, etc.)

In this section, we will discuss the implications of (1) and (2) for gVisor and Nabla. We will discuss (3) in the following section.

The gVisor model

Let’s dive a little deeper into a hypothetical scenario where an attacker obtains code exec capabilities in gvisor. So assuming that we are able to exploit and get code exec in gVisor, what protections do we get in terms of isolation?

Syscalls are the way that an application interfaces with gVisor, and we can view it as an estimation of the attack surface² from the application to gVisor. We’ve seen here that the exposure of 300+ syscalls consists of shmctl, which allowed the application process to trigger a bug in gVisor.

So assuming that we’ve obtained code exec in gVisor, what is our exposure? We looked at the seccomp policy of gVisor, and it is assuring to see that the seccomp policy only allows 83 of the 300+ syscalls available. Therefore, given an exploit in the gVisor kernel, the attack surface is much reduced compared to regular containers.

+--------+              +---------+
|  p1a   |              |   p2a   |                   APP
+--------+              +---------+
<<< 300 syscalls  >>>   <<< 300 syscalls  >>>
+-------------------+   +-------------------+
|      gVisor       |   |      gVisor       |         USERSPACE KERNEL
|   (code exec)     |   |                   |
+-------------------+   +-------------------+
<<<  83 syscalls  >>>   <<<  83 syscalls  >>>    <--- Attack surface
+--------------------------------------------------+
|                    kernel                        |  KERNEL
+--------------------------------------------------+

Based on this, in order to break isolation between gVisor instances, an attacker would need to obtain code exec or something similar in the kernel via the 83 syscalls allowed. The reduction of allowed syscalls to the host from 300+ to 83 shows its (1) Reduced Attack Surface.

In addition, gVisor intercepts the syscalls from the application via a ptrace, preventing users from directly invoking host syscalls. Therefore, gVisor provides (2) Guarding of host kernel calls in this way. This is shown in the diagram by the additional 300 syscalls interface between the application process and gVisor.

Another model: LibOS with Nabla

Let’s compare the Nabla containers model with that of gVisor. The components are very similar, gVisor is a userspace kernel, and Nabla has a Library OS (LibOS), which is a userspace kernel linked as a library. This LibOS does a similar function - implementing the functionality of the kernel in userspace and only calling the host kernel when necessary.

In terms of (1) Reduced Attack Surface, Nabla provides a reduced attack surface of only 7 of the 300+ syscalls from the use of the solo5 interface. We enforce a seccomp policy that uses only 7 syscalls, with only two file descriptors (a block device and tap device).

However, in terms of (2) Guarding of host kernel calls, we currently do not force the application to make syscalls via the userspace kernel (this is depicted in the diagram by the lack of explicit segregation between the application and the userspace kernel). Because the available syscalls and access to file descriptors are already so limited, doing so may not net us that much benefit. However, if the cost to do so is low, it should be done!

+--------+              +---------+
|  p1a   |              |   p2a   |                   APP
+-------------------+   +-------------------+
|    nabla LibOS    |   |   nabla LibOS     |         USERSPACE KERNEL
|    (code exec)    |   |                   |
+-------------------+   +-------------------+
<<<   7 syscalls  >>>   <<<   7 syscalls  >>>    <--- Attack surface
+--------------------------------------------------+
|                    kernel                        |  KERNEL
+--------------------------------------------------+

To summarize, here’s a side-by-side comparison of the models:

      gVisor                   NABLA
      ------                   -----

+--------+
|  p1a   |                                            APP
+--------+              +---------+
<<< 300 syscalls  >>>   |   p2a   |              <--- Guard Calls (2)
+-------------------+   +-------------------+
|      gVisor       |   |   nabla LibOS     |         USERSPACE KERNEL
|   (code exec)     |   |                   |
+-------------------+   +-------------------+
<<<  83 syscalls  >>>   <<<   7 syscalls  >>>    <--- Atk. surface (1)
+--------------------------------------------------+
|                    kernel                        |  KERNEL
+--------------------------------------------------+

But it’s a memory safe langauge!

We will now talk about (3) Safer kernel implementation.

Some argue that gVisor is written in golang, and therefore memory safe. Golang is safer, but is not immune to the same class of bugs that we’ve discussed in this post.

I believe that golang is a safer language from the programming language standpoint (i.e. the common design pattern doesn’t allow writing to arbitrary pointers and performing malloc/free operations), but it is hard to guarantee that the implementation is true to the language semantics. Often, bugs in implementation or performance trade-offs (this is rampant in the hardware work with Spectre variant bugs) are the result of this.

As a bonus, here are two implementations of memory corruption POCs - one from an implementation bug, and one from a performance design standpoint.

P.S. I personally would love to see a formally verified userspace kernel!

Implementation bug

The security research and CTF group that I am part of, Plaid Parliament of Pwning (PPP), worked on showing a memory corruption attack leading to code exec in the golang playground. You may view the blog post by Alex Reece here.

Design bug

The bug here is due a data race of some data types using multiword values (i.e. a type is represented by multiple words in byte) and a race in the garbage collector.

The vulnerability is detailed on STALKR’s blog post “Golang data races to break memory safety” and discussion on the design and trade-offs are talked about in Russ Cox’s blogpost.

What’s next?

In conclusion, we’ve observed that gVisor and Nabla draw several similarities in the threat model, and yet they focus on different methods of providing security - maximize attack surface reduction (Nabla) vs guarding host syscalls and a safer implementation (gVisor). We would like to find ways to quantify the two additional security ideas from gvisor (guarding host syscalls and safer implementations), to include in our isolation metric.

As discussed, we think the guarding of host kernel calls provides security value add for Nabla, but we hypothesize that it may not add much to the security of the very limited attack surface. However, if it is possible to do this with little trade-off, it would be something that we would like to do in the future!

We hope to see more interesting proof of concepts like Max’s to help us continue to fine tune our isolation threat model!

Nabla Containers