Rux
A Little Book about Rust Microkernel

Table of Contents

This is the design documentation of Rux. Rux is a microkernel written in Rust, featuring a L4-family microkernel design.

In the mean time, have a look at the source code.

Overview

Rux's goal is to become a safe general-purpose microkernel. It tries to take advantage of Rust's memory model – ownership and lifetime. While the kernel will be small, unsafe code should be kept minimal. This makes updating functionalities of the kernel hassle-free.

Rux uses a design that is similar to seL4. While there won't be formal verification in the short term, it tries to address some design issues of seL4, for example, capability allocation.

Quickstart

Currently, due to packaging problem, the kernel is only tested to compile and run on Linux with x86_64. Platforms with qemu and compiler target of x86_64 should all be able to run this kernel, but do it at your own risk.

To run the kernel, first install Rust, qemu, and cross-compiled GNU's binutils. The easiest way to do it is through the shell.nix file provided in the source code. Install Nix, then go to the source code root and run the following command:

nix-shell

After that, run:

make run

You should see the kernel start to run with a qemu VGA buffer. The buffer, after the kernel successfully booted, should show a simple command-line interface controlled by rinit program launched by the kernel. Several commands can be used to test things out.

echo [message]

Echo messages and print them back to the VGA buffer.

list

Print the current CPool slots into the kernel message buffer.

retype cpool [source slot id] [target slot id]

Retype an Untyped capability into a CPool capability. [source slot id] should be a valid slot index of an Untyped capability. [target slot id] should be an empty slot for holding the retyped CPool capability.

Source Code Structure

The development of Rux happen in the master branch in the source code tree. The kernel resides in the kernel folder, with platform-specific code in kernel/src/arch. For the x86_64 platform, the kernel is booted from kernel/src/arch/x86_64/start.S. The assembly code them jumps to the kinit function in kernel/src/arch/x86_64/init/mod.rs.

After the kernel is bootstrapped, it will initialize a user-space program called rinit, which resides in the rinit folder. The user-space program talks with the kernel through system calls, with ABI defined in the package abi, and wrapped in system.

Kernel Design

Capabilities

Capabilities are used in kernel to manage Kernel Objects. Those Capabilities are reference-counted pointers that provide management for object lifecycles.

Capabilities in user-space can be accessed using so-called CAddress, refered through the root capability of the user-space task. This helps to handle all permission managements for the kernel, and thus no priviliged program or account is needed.

Current implemented capabilities are:

  • Untyped memory capability (UntypedCap)
  • Capability pool capability (CPoolCap)
  • Paging capability
    • PML4Cap, PDPTCap, PDCap, PTCap
    • RawPageCap, TaskBufferPageCap
    • VGA buffer
  • CPU time sharing capability (TaskCap)
  • Inter-process communication capability (ChannelCap)

Example: Initialize a New Task

This example shows how to initialize a new task using the capability system.

  • Create an empty TaskCap.
  • Create an empty CPoolCap.
  • Initialize paging capabilities (One PML4Cap, Several PDPTCap, PDCap, PTCap and RawPageCap)
  • Assign the stack pointer in TaskCap.
  • Load the program into those RawPageCap.
  • Assign the PML4Cap to TaskCap.
  • Assign the CPoolCap to TaskCap.
  • Switch to the task!

Implementation

Implementing reference-counted object is a little bit tricky in kernel, as objects need to be immediately freed, and all weak pointers need to be cleared after the last strong pointer goes out. Rux's implementation uses something called WeakPool to implement this. The original reference counted object (called Inner), form a double-linked list into the nodes in multiple WeakPools.

Capability Pools

Capability Pools (or CPool) are used to hold multiple capability together. This is useful for programs to pass around permissions, and is essential for CPool addressing. In implementation, capability pools are implemented as a WeakPool.

Tasks

A task capability has a pointer to a capability pool (the root for CPool addressing), a task buffer (for kernel calls), and a top-level page table. When switching to a task, the kernel switches to the page table specified.

The switch_to function implemented uses several tricks to make it "safe" as in Rust's sense. When an interrupt happens in userspace, the kernel makes it as if the switch_to function has returned.

In kernel-space, interrupts are disabled.

Channels

Tasks communicate with each other through channels. A channel has a short buffer holding messages sent from a task, and will respond this to the first task that calls wait on the channel.

General

Design Decisions

Kernel calls are sync. However, they are not a lot of kernel calls (Send, Call, Wait, Reply, etc.) and most of them are really short. Nearly all kernel functionalities are implemented the same way as IPC. IPCs are one-way, thus async.

Functionalities to be included in the microkernel:

  • Task scheduling. This will be a simple Robin-round scheduler.
  • Frame allocation. However, paging is managed at user-mode.
  • IPC.
  • Other hardware primitives.

Kernel Address Mapping

Above 0xFFFF FFFF 8000 0000 is reserved by the kernel. For a page table capability, it cannot be mapped. Below that is mappable for a page table capability.

Kernel space is always available in any virtual address space. Currently to accommodate this, any PML4's last entry will always point to the kernel's PDPT address.

Each retyped sub-capability form a linked list. It is used to check the memory to be allocated is actually free.

New Page Table

The kernel maintains a static pointer to a Page Directory. Thus initial new page table would require a PML4, a PDPT, a PD, a PT for object pool, one or more (usually one) PT for kernel pages.

Kernel Stack

A guard page will be placed on the original PML4.

Accessing Kernel Objects

Paging is managed at user-mode. As a result, there's no way to dynamically map kernel objects to make it accessible. Thus we need ways to access kernel objects with no memory allocation.

At bootstrapping, a fixed number of virtual address is reserved for accessing kernel objects. The kernel then use a with_object<T>(addr, f(obj)) function to map certain physical address to the reserved virtual address. This allows us to access any physical address we want.

Bootstrapping

Bootstrapping is arch-specific. Usually it creates a virtual address space and set up interrupts according to the design above, and then hand over to kmain. Kmain will then use capability-related methods (the same as handling syscalls) to invoke an initial TCB, its related page frames and other system resources capabilities. It then hand over control to that user-space program.

About Unsafe

In Rust, safe code means totally safe, which prevents:

  • Dereferencing null or dangling pointers
  • Reading uninitialized memory
  • Breaking the pointer aliasing rules
  • Producing invalid primitive values:
    • dangling/null references
    • a bool that isn't 0 or 1
    • an undefined enum discriminant
    • a char outside the ranges [0x0, 0xD7FF] and [0xE000, 0x10FFFF]
    • A non-utf8 str
  • Unwinding into another language
  • Causing a data race

This behaviour should be preserved in Rux.

Capability

Reference to an area of memory that is used to support the kernel service that the capability provides.

All capability information, including its type, is invisible to user-mode. This is to make system call monitor, as well as IPC over network possible.

Invocation

cspace address, Send, Wait

Inter-process Communication

Using endpoint objects. A one-way channel/broadcaster. Wait and Send paired and perform message transfer.

  • Contains a queue of threads.
  • State flag whether Wait or Send.

Badged Messages

Endpoint cap –Mint–> Badged endpoint cap

Message Transfer

  • Registers + message buffer
  • Message buffer cap missing -> truncated

Capability Registers

Used to transfer capabilities

Non-blocking Invocations

Dropped if no one is waiting

Call

Creates a resume cap

Reply (non-blocking) or SaveCaller ReplyWait: reply followed by a wait

System Call Monitors

Replace the kernel object cap with an endpoint

Memory Management

Capability

A capability is a reference to a kernel object.

Untyped

Various-sized with maximum u64.

Retype
  • source: pointing to a CNode entry of the source Untyped.
  • newType: type of the new object.
  • newSizeBits: size bits for objects of various sizes.
  • slots: array of slots to be inserted into (only get the starting point from the system call).

The regionBase and freeRegionBase (which will be aligned) will be calculated. It actually do the frame allocation. Given we have a double linked list, this would be feasible.

Retyped cap will be a child of the original cap in the MDB.

CNode

Contains CTE, which is a capability with MDB entry.

MDB entry must be only inside CTE but not kernel stack, because there's no way it can be pointed to. This we ensure that all CTE are valid at all times (although newly created caps may not be revoked (which is seldom needed) if accidentally deleted by the kernel).

Revoke(slot)

Revoke a cap. This deletes the underlying kernel objects and all child cap. Updating MDB is required.

Delete(slot)

Delete the current cap. Revoke if it is the last reference. Updating MDB is required.

Recycle(slot)

(Implementing in a later iteration)

Reset a cap to its original state. Child caps will be revoked.

Insert(cap, srcSlot, destSlot)

(Not for system call, but helper function)

Insert cap to destSlot, in MDB as a child of srcSlot.

Mint and copy uses this.

P4/P3/P2/P1/Page

Four-level page tables and pages for x8664. Can be created by retyping Untype. P4 always has the last entry pointing as recursive mapping, and the second-half as kernel space.

Kernel needs to allocate its own needed P4/P3/P2/P1 when bootstrapping. Higher-half P3 is always shared.

P4/P3/P2/P1 and Page need to have their own cap because of MDB. Otherwise revoking Untyped would not work.

Each page table and page cap can only be mapped once. Copy or mint the cap to share.

Each cap contains its mapped address (if mapped) to allow revocation. Cap rights include map/unmap/switch.

This is exactly how seL4 handles it.

After revocation of a P4, if it makes a thread contain invalid P4 root, then its register is set to the global/idle P4, which contains only kernel mappings. (This is feasible because TCB acts like a CNode, and the VSpace attribute is an actual cap.)

P3/P2/P1/Page map/unmap

Map P3 to P4, P2 to P3, P1 to P2 and Page to P1.

Switch

Set thread's VTable attribute to switch.

Tasks

TCB

CTable and VTable can be treated as a CNode. Thus they are also with MDB entry. TCB contains:

  • CTable (thread's root CNode)
  • VTable (thread's page table)
  • State (thread's state)
  • Priority (for a Robin-round scheduler)
  • Queued
  • Time slice

To be implemented in a later iteration:

  • Reply
  • Caller
  • IPCBufferFrame
  • IPCBuffer's virtual address (for user-level reading)
  • Security domain
  • Falut state
  • User-level context

Memory Mapping Database

How to make capabilities always valid in kernel mode?

It's probablly better to keep the mapping database together with the reference (the capability).

Capability Functionality

Those that needs to be kept into the mapping database are untyped, and those that the untyped retypes into.

Deletion and Revocation

For multi-kernel structure, memory mapping database becomes troublesome. When one kernel is operating on a capability, another kernel might be trying to delete the resources under it (or capabilities in the parental chain). Simply locking all resources before deleting could result in dead locks. Also, recursive revoking capabilities could result in a long time in kernel mode, which could be bad for real-time systems. Here, we present a solution for this problem.

Revoking

When a capability is mint, the sub-capability becomes a child of the original capability, if they are still related. This includes retype operation (when the original capability is still preserved). An exception to this is Untyped's split operation, which splits one untyped into two. A multi-kernel design should take advantage of this, and obtain multiple untyped for every kernel.

When a capability is revoked, its underlying resources is deleted. However, for this operation, we only delete this single capability, and immediately return to user-space.

Before deletion, of course, we'll lock the capability, together with all its parental chain.

Retyping and Minting

When retyping and minting, the new capability becomes a child of the original capability. At this time, a pointer to the original capability will be created. Locks to the original capability and its parental chain is required.

Capability –> Resource Descriptor –parent–> Resource Descriptor

A resource descriptor needs to contain its children information. This creates static overhead for Untyped and Endpoints. CPool, TCB and many others do not contain children. Their capability can be mint or copied, but that only modifies the capability, not the resource descriptor.

A capability contains a validation path from top-level resource descriptor (one that is above kernel space) to the actual resource descriptor.

In order to create the pointer, we require the original capability not to be on the stack. That is, all capabilities that have children cannot be on the stack.

When a capability has children, it cannot be moved.

Weak Box

Instead of preserve a memory mapping database, we use weak boxes for obtaining capabilities. When obtaining a capability from a CPool or when a new capability is created, a weak box is returned. the kernel code need to call with_half function on the weak box to obtain the half that is actually needed.

When calling with_half, the weak box will first check whether all capabilities in the parental chain actually exist (if not, return None), and lock it if they do.

A Non-circular Capability Model

Functionality:

  • Smart pointers to kernel objects.
  • Referrable, movable, retypable, deletable by userspace.

Related Kernel Design