The choices we make: Impact of using host filesystem interface for secure containers

The common theme behind all the secure container runtimes–Kata, gVisor and Nabla–is to improve container isolation. However, to achieve the same goal, each solution adopts a different approach. Kata employs virtual machine isolation, while Gvisor and Nabla use a libOS+seccomp combination. While attack surface minimization is present in all three solutions, Nabla is the only one that has put a primary focus on identifying, measuring and minimizing the attack surface to the kernel. This post will demonstrate what can go wrong if attack surface metrics are not used to guide the design of a secure container runtime. Specifically, we highlight the impact of a key design choice difference between Nabla and Kata containers–the amount of host kernel functionality accessible to a guest from a filesystem perspective.

Nabla uses a LibOS-powered filesystem interface, whereas Kata directly uses the host’s filesystem interface, exposed via the 9p protocol1. This means that from a filesystem perspective, today Kata permits access to more host kernel functions (filesystem and block device components), as compared to Nabla (block device only). And thus, in case of the former, there is a greater probability of hitting a kernel bug, or exploiting a vulnerability.

Kernel Oops!

We highlight this by reproducing an exploit of a recently discovered bug in the ext4 filesystem code- CVE-2018-10840. In this case, removing the extended attributes of a file on a crafted ext4 image leads to a kernel oops due to a buffer overflow. The culprit syscall which allows a userspace program to trigger this bug is removexattr(). When testing with a Ubuntu 18.042 host system, we were able to trigger this bug in the kernel, from inside a Kata container. The following video shows a demo of this.

fs-bug-demo (Click on the image to launch the demo gif)

The Bug

We used the original author’s crafted image to test this bug, which can be exposed with the following simple trigger code:

// mpoint is the path to the mounted crafted image
static void activity(char *mpoint) {
  char *xattr;
  int err = asprintf(&xattr, "%s/foo/bar/xattr", mpoint);
  char buf2[113];
  memset(buf2, 0, sizeof(buf2));
  listxattr(xattr, buf2, sizeof(buf2));
  removexattr(xattr, "user.mime_type");
}

The buffer overflow happens in the kernel’s fs/ext4/xattr.c:ext4_xattr_set_entry() function, because of a missed check on the size parameter of the memmove() function, which is negative when trying to remove an extended attribute (user.mime_type) for the file foo/bar/xattr in the crafted image. The fix is fairly straightforward in this case–adding a value check.

Which secure containers are safe?

Today, Kata Containers are vulnerable to this bug, due to the 9p protocol allowing to pass through too much of the filesystem-related system calls. On the other hand, Nabla’s philosophy of limiting access to the host kernel means that a container’s file access does not interface with the host’s filesystem. Specifically, Nabla does not use a 9p or a passthrough protocol, with syscalls on the filesystem handled at the LibOS (rumprun) level. Thus, this bug would never get triggered3, even if the host kernel is vulnerable, since removexattr() is not propagated to the host. Nabla’s minimal access into the host is facilitated by a fairly restrictive seccomp profile.

The bug does not surface in case of gVisor as well. The Sentry process does not support the syscall–”Removexattr, requires filesystem support”, which also does not feature in the set of syscalls executed by the Gofer. Now, we aren’t aware of gVisor’s methodology of identifying which parts of the kernel are safe to expose, but so far as this bug is concerned, gVisor isn’t affected4.

So what?

The way this bug is exploited in this example doesn’t present a serious security vulnerability in a container setting, since:

  1. it needs a crafted image to be exposed as rootfs or a volume to the containers,
  2. the end product of a successful exploit means only a killed process within the adversary container, and
  3. the kernel remains functional despite the bug being triggered.

But the underlying heap buffer overflow can certainly be exploited for deadlier attacks. It serves as an example of what’s possible when the host kernel functionality is available for (ab)use, and the need to minimize the attack surface. Even from a filesystem access perspective, we cannot eliminate the existence of bugs that do not require a crafted image, as well as the potential of bugs causing a full kernel panic, or serious side-effects such as privilege escalation.

In this blog, although we chose the filesystem interface to highlight Nabla’s attack surface minimization philosophy, the same is applicable to other components as well, including bypassing bugs in the virtualization interface itself5!

To summarize: even though the goal is the same (container isolation), the design choice differences matter–its about the ‘principles’ ;)

  1. Kata does use a block device interface for its containers, when Docker is configured to use the devicemapper storage driver. 

  2. Ubuntu’s latest kernel in use-4.15.0-38-generic contains the patch for this bug. We used Ubuntu’s 4.15.0-29-generic kernel to trigger this bug. 

  3. For complete transparency, removexattr() would not work in Nabla as of today, since support for writable filesystem is not fully complete. Contributors welcome! 

  4. Just like gVisor, it can be argued that Nabla also got “lucky” with its attack surface metrics. We hope this debate encourages the community to think about better ways of determining what’s safe to expose to the containers above. 

  5. Read more about our take on virtualization isolation in our HotCloud ‘18 paper Say Goodbye to Virtualization for a Safer Cloud